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Preface 



The First European Business Intelligence Summer School (eBISS 2011) 
attracted experts from academia and industry as well as PhD students inter- 
ested in foundational and applicational aspects of business intelligence (BI) . 
This volume contains the lecture notes of the summer school, which took 
place in Paris, France, in July 2011. 

The first chapter reviews consolidated results in data warehouses and 
describes open research fields. The latter include the need to cope with more 
complex data, both in structure and semantics, and keeping up with the de- 
mands of new application domains such as Web, financial, manufacturing, 
life sciences, multimedia, and spatiotemporal applications. 

The second chapter delves into the issue of data warehouse performance. 
It reviews three types of data structures, namely, indexes, materialized views, 
and partitioned tables, which are essential for efficiently answering analyti- 
cal queries. The chapter also shows how these techniques are applied when 
executing star queries in three commercial data warehouse systems. 

The third chapter shows how popular user-centric techniques, namely, 
personalization and recommendation, can be applied to OLAP queries. The 
chapter characterizes the approaches proposed in this respect in terms of for- 
mulation effort, prescriptiveness, proactiveness, expressiveness, and in terms 
of the data leveraged. 

The fourth chapter discusses the issue of supporting analytical queries over 
Web-based textual content. This problem lies at the intersection of the do- 
mains of search engines and BI. The chapter shows three recent enabling 
technologies in this respect, namely, cloud computing, self-supervised key- 
word generation, and fact extraction. The chapter ends by describing the 
GoOLAP system, a platform supporting Web-scale business analytics. 

The fifth chapter provides an overview of Business Intelligence 2.0, an 
extension of traditional BI based on the evolution of the Web and emerg- 
ing technologies such as cloud computing. BI 2.0 promises to enable better 
decision making by complementing traditional organizational data with infor- 
mation present in the Web, like opinions or information about competitors, 




VI 



Preface 



while using collective intelligence, collaborative work through social networks, 
and supporting BI systems with cloud computing. 

The sixth chapter surveys the issue of graph mining over social networks. 
In this respect, community detection may bring very valuable information 
about the structure of an existing social network. The chapter defines what 
is a community in a social network context, and then surveys the most popular 
techniques to detect such communities. 

The seventh chapter presents an overview of the Semantic Web, with a spe- 
cial focus on semantic databases, or triplestores, which are specific databases 
aimed at integrating, storing, and querying the vast amounts of data gener- 
ated on the Semantic Web every day. The chapter also provides an overview 
of Bl-related scenarios where semantic technologies and triplestores provide 
valuable advantage and differentiation. 

The eight chapter introduces the service paradigm and analyzes its 
impacts on BI. Specific techniques to engineering service systems are pre- 
sented, including cloud computing, service-oriented architectures (SO A), and 
business process modeling (BPM). The chapter also analyzes whether it is 
possible to consider BI as a service, and analyzes how to use BI techniques 
to enhance services. 

Finally, the ninth chapter explores collaborative BI, which aims at extend- 
ing the decision-making process beyond the company boundaries. Due to its 
inherent distribute nature, collaborative BI requires innovative approaches 
and architectures. The chapter surveys data warehouse integration as an 
enabling technique for collaborative BI and outlines a new peer-to-peer frame- 
work, called Business Intelligence Network, aiming at sharing business infor- 
mation for the decision-making process. 

The lectures of the summer school surveyed established areas in the BI do- 
main. They also also pointed out aspects that are beginning to be explored or 
are still waiting for a solution. We hope that the school’s material will inspire 
further exciting research in these areas. We are grateful to all the lecturers 
and their co-authors for their excellent contributions, the participants of the 
summer school for their enthusiasm, and the external referees for their careful 
work that helped us to improve the lectures in this volume. 
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Data Warehouses: Next Challenges 



Alejandro Vaisman and Esteban Zimanyi 

Department of Computer and Decision Engineering (CoDE) 
Universite Libre de Bruxelles 
{avaisman, ezimanyi}@ulb. ac .be 



Summary. Data Warehouses are a fundamental component of today’s Business 
Intelligence infrastructure. They allow to consolidate heterogeneous data from dis- 
tributed data stores and transform it into strategic indicators for decision making. 
In this tutorial we give an overview of current state of the art and point out to 
next challenges in the area. In particular, this includes to cope with more complex 
data, both in structure and semantics, and keeping up with the demands of new 
application domains such as Web, financial, manufacturing, genomic, biological, life 
science, multimedia, spatial, and spatiotemporal applications. We review consoli- 
dated resaerch in spatio-temporal databases, and open research fields, like real-time 
Business Intelligence and Semantic Web Data Warehousing and OLAP. 



Keywords: data warehouses, OLAP, spatiotemporal data warehouses, real- 
time data warehouses, semantic data warehouses. 

1.1 Introduction: Data Warehousing Past, Present, and 
New Applications 

OLAP (On-Line Analytical Processing) [IJ comprises a set of tools and algo- 
rithms that allow efficiently querying multidimensional databases containing 
large amounts of data, usually called Data Warehouses. In OLAP, data are 
organized as a set of dimensions and fact tables. In this multidimensional 
model, data can be perceived as a data cube , where each cell contains mea- 
sures of interest. OLAP dimensions are further organized in hierarchies that 
favor the data aggregation process 0. Several techniques have been devel- 
oped for query processing, most of them involving some kind of aggregate 
precomputation 0 . 

OLAP (or, more generally, Business Intelligence) (BI) software, produces 
reports and interactive interfaces that summarize data via basic aggregation 
functions (e.g., counts and averages) over various hierarchical breakdowns of 
the data into groups, defined in the dimension hierarchies. A lot of academic 
research and industrial development was carried out throughout the 1990’s 
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related to conceptual modeling, query processing and optimization, aggregate 
precomputation, etc. 



1.1.1 New BI Domains and Challenges 

Since the mid 90’s, DW and BI applications have been built to consolidate en- 
terprise business data, allowing taking timely and informed decisions based on 
up-to-date consolidated data. However, the availability of enormous amounts 
of data from different domains is calling for a shift in the way DW and BI 
practices are being carried out. It is becoming clear that, for certain kinds 
of BI applications, the traditional approach, where day-to-day business data 
produced in an organization is collected in a huge common repository for data 
analysis, needs to be revised, to account for efficiently handling large-scale 
data. Moreover, in the emerging domains where BI practices are gaining 
acceptance, massive-scale data sources are becoming common, posing new 
challenges to the DW research community. 

New database architectures are gaining momentum. Parallelism is becom- 
ing a must for large DW processing, as Stonebraker states 0J. Moreover, 
column stores databases are strong candidates for DW architectures, since 
they deliver much better performance than classic row databases, for fact ta- 
bles with a large number of attributes. The MapReduce model 0 is becoming 
increasingly popular, challenging traditional parallel DBMS architectures 0. 
Even if it is not clear if this approach can be applied to all kinds of DWs and 
BI applications (since it has been argued that MapReduce is not appropri- 
ate for analyzing structured data, given the typically high number of joins 
required), many large DW have been built based on this model. Moreover, 
there is a wave of opinion arguing that data warehousing can take advan- 
tage of the scalability of MapReduce (at the low price of adding commodity 
computer nodes) |Z . As an example, we can mention the (now famous) Face- 
book data warehouse built using Hadoop (an open source implementation 
of MapReduce) 0 0 . Along these same lines, Cohen et al. |CQ3] propose a 
paradigm shift in DW design, introducing the MAD Skills approach, aimed 
at allowing more flexibility when selecting the data sources, and more com- 
plex analysis functions than the typical OLAP operations. 

The above is not the only challenge for OLAP in the years to come. There 
is also a need to cope with complex data, both in structure and semantics, 
and keeping up with the demands of new application domains such as Web, 
financial, manufacturing, genomic, biological, life science, multimedia, spa- 
tial, and spatiotemporal applications. Most of these new kinds of data also 
require the vision shift expressed above, since large amounts of data populate 
the databases of applications in these domains. 

The intention of this document is two- fold: on the one hand, to review 
the state-of-the-art in the above mentioned domains. On the other hand, 
to discuss what is needed to be done to address new challenges in the BI 
world. For space limitations, and to make this discussion more concrete, in the 
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remainder we focus in three problems: spatio-temporal DW and BI (Section 
I I .21 . Real-Time DW ('Section 1 1 .31) . and DW on the Semantic Web (Section 
I I .4B . The rationale for this decision is that the first topic has been attracting 
the attention from research and industry since long ago, and is now quite 
mature and steadily producing relevant results, some of which we discuss in 
this work. Real-time DW is a clear challenge, in light of the large volumes of 
data handled by the organizations, although in our opinion, the field is open 
for research. Finally, the so-called Semantic Web, where billions of triples 
are stored daily, will be most likely the place where data and OLAP-style 
analysis will occur in the future. The analysis of web data (e.g., in RDF m 
format) embeds the problems described above: large amounts of data, volatile 
sources, and real-time analysis needs. 



1.2 Spatio-Temporal Data Warehousing and OLAP 

Geographic Information Systems (GIS) have been extensively used in various 
application domains, ranging from economical, ecological, and demographic 
analysis, to city and route planning Spatial information in a GIS is typi- 
cally stored in different so-called thematic layers (or themes). Information in 
themes consists of spatial data (i.e., geometric objects) associated to thematic 
(alphanumeric) information. 

Rivest et al. na introduced the notion of SOLAP (standing for Spatial 
OLAP), a paradigm aimed at exploring spatial data by drilling on maps, 
as it is performed in OLAP with tables and charts. They describe, in an 
informal way, the desirable features and operators a SOLAP system should 
have. In one of the first implementation efforts, Shekhar et al. m introduced 
MapCube, a visualization tool for spatial data cubes. Given a so-called base 
map, cartographic preferences, and an aggregation hierarchy, the MapCube 
operator produces an album of maps that can be navigated via roll-up and 
drill-down operations. Since then, SOLAP concepts and operators have been 
included in many commercial tools. Gomez et al. H3 introduces a SOLAP 
system denoted Piet. This proposal in based on a data model where GIS and 
warehouse data are maintained separately, and a matching function binds 
the two components. Piet comes equipped with an SQL-like query language, 
called Piet-QL. This language allows expressing GIS queries filtered by the 
values in the cells of a data cube (i.e., filtered by aggregated datajj, and 
OLAP queries filtered by conditions over spatial data 

1.2.1 A Taxonomy for Spatio-Temporal OLAP 

We will see in the remainder of this paper, that existing proposals for spatial 
data warehousing cover different functional requirements, but, with limited 

A Piet-QL demo can be found at http://piet.exp.dc.uba.ar/pietql 
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exceptions, there is no clear specification of the kinds of queries these propos- 
als address. The first attempt to provide a formal characterization of spatio- 
temporal data warehousing is the work by Vaisman and Zimanyi ESI. where 
a taxonomy of spatio-temporal OLAP queries is proposed (see Figure Oi- 
Generally speaking, the term SOLAP refers to the interaction between 
static GIS and static data warehouses (i.e., spatial objects and warehouse 
dimensions that do not change across time). When time gets into play, things 
become more involved. Different models exist for temporal data warehousing, 
depending on the approach followed to implement the warehouse pn rnq . in 
what follows, we denote a data warehouse temporal if it keeps track of the 
history of the instances of the warehouse dimensions, i.e., we assume there 
are no structural (schema) changes. 



Temporal 
Dimensions J 

TOLAP 



f Spatial 
TOLAP 



OLAP | f GIS | 


1 








g SOLAP g ^Spatio-Temporal) 






( Spatio-Temporal) 
OLAP J 



(Spatio-Temporal) 

TOLAP 



Fig. 1.1. A taxonomy for spatio-temporal data warehousing 



In the taxonomy defined in m the authors consider four basic classes: 
Temporal dimensions, OLAP, GIS, and moving data types. Adding moving 
data types to GIS produces Spatio-Temporal Data , typically allowing trajec- 
tory analysis in a geographic environment. Providing OLAP with the ability 
of handling temporal dimensions produces the concept of Temporal OLAP 
(TOLAP). The interaction of OLAP and GIS is denoted Spatial OLAP 
(SOLAP). The interaction between GIS and TOLAP is called Spatial TO- 
LAP (S-TOLAP). Adding OLAP capabilities to spatio-temporal data results 
in Spatio-Temporal OLAP (ST-OLAP). Finally, if the latter supports tem- 
poral dimensions we have Spatio-Temporal TOLAP (ST-TOLAP). 

We next discuss the kinds of queries in each one of the classes in the 
taxonomy of Figure I I . 1 1 The classification is based on the relational calcu- 
lus with aggregate functions m ■ Then, OLAP queries are defined as the 
ones expressible in such calculus. Let us call lZ agg these class of queries. By 
adding extending !Z agg incrementally with new data types, we can analyze 
the expressive power of each extension. We start by considering a set of non- 
temporal types (which allow expressing OLAP and SOLAP queries). These 
include a set of base types which are int, real, bool, and string, with the usual 
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interpretation, time types which are instant and periods, and the four spatial 
data types , point, points, line, and region. We have also Topological predicates 
- i.e., touches, which determines whether the boundaries of its two arguments 
have a nonempty intersection, or overlaps; set operations , like crossings, which 
returns the isolated points in the result of an intersection between two lines, 
whereas intersection returns common line parts. Aggregation operators re- 
duce point sets to points, and numeric operations , like no_components, which 
returns the number of disjoint maximal connected subsets (e.g., faces for 
a region, connected components for a line graph). There are other spatial 
functions, like area or distance. 

SOLAP queries should be able to compute aggregation of spatial and non- 
spatial measures over spatial and non-spatial dimensions. Therefore, spatial 
data types are needed. Then, the class of SOLAP queries is the class com- 
posed of all the queries that can be expressed by 7 Z agg augmented with spatial 
data types. 

The notion of Temporal OLAP (TOLAP) arises when evolution of the 
dimension instances in the data warehouse is supported, a problem also re- 
ferred to as slowly- changing dimensions p. This evolution is captured by 
using temporal types. In other words, when at least one of the dimensions 
in the data warehouse includes a time type, we say that the warehouse sup- 
ports the TOLAP model. Formally, the class of TOLAP queries is the class 
of queries expressed in lZ agg augmented with the time data types. 

Spatio-temporal OLAP (ST-OLAP) accounts for the case when the spatial 
objects evolve over time. For this, the authors consider moving types, denoted 
moving(a) where cc is a spatial data type, and the operations over these data 
types. For example, the projection of a moving point into the plane may 
consist of points and lines returned by the operations locations and trajectory, 
respectively. The projection of a moving line into the plane may consist of 
lines and regions, returned by the operations routes and traversed. Finally, 
the projection of a moving region into the plane consists in a region, which 
is also returned by the operation traversed. Other operations are defined for 
example, to compute the rate of change for points (e.g., speed, mdirection, 
etc.). In this way, the class of ST-OLAP queries is defined as the language 
queries expressed by TZ agg augmented with spatial types and moving spatial 
types. 

Spatial TOLAP (S-TOLAP) covers the case when in addition to having 
spatial objects and attributes in the data warehouse, the dimensions are also 
temporal. Therefore, Spatial TOLAP queries are the ones expressible in 7 Z agg 
augmented with time types, spatial types and moving types. 

Finally, Spatio-Temporal TOLAP (ST-TOLAP) is the most general case 
where there are moving geometries and the dimensions vary over time. That 
is, Spatio-Temporal TOLAP includes the class of queries expressible in 7Z agg 
augmented with time types, spatial types, moving spatial types, and moving 
types. 
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1.2.2 Conceptual Modeling in Spatio-Temporal OLAP 

Several conceptual models have been proposed for spatio-temporal data ware- 
houses. This section is based on the one introduced by Malinowski and 
Zimanyi m ■ The model, called MultiDim, considers a dimension level as 
spatial if it is represented as a spatial data type (e.g., point, region), and 
where spatial levels may be related through topological relationships (e.g., 
contains, overlaps). Spatial measures are characterized as measures repre- 
senting a geometry, which can be aggregated along the dimensions. 

We illustrate our discussion with the following example, where the En- 
vironmental Control Agency of a country has a collection of water stations 
measuring the value of polluting substances at regular time intervals. The ap- 
plication has maps describing rivers, water stations, and the political division 
of the country into provinces and districts. Figure U~^l shows the conceptual 
schema depicting the above scenario using the MultiDim model m- There is 
one fact relationship, WaterPollution, to which several dimensions are related. 
The fact relationship WaterPollution has two measures, commonArea and load, 
and is related to five dimensions: Time, Station, Pollutant, River, and District. 
Dimensions are composed of levels and hierarchies. For example, while the 
Station dimension has only one level, the District dimension is composed of 
two levels, District and Province, with a one-to-many parent-child relationship 
defined between them. 




Fig. 1.2. An example of a spatial data warehouse 



In the MultiDim model the spatiality of elements is indicated by pic- 
tograms. For example, Station, River, and District are spatial levels; they have 
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a geometry represented by a point, a line, and a region, respectively. Simi- 
larly, the attribute capital in Province, as well as the measures commonArea 
in the three fact relationships are spatial. Finally, topological relationships 
may be represented in fact relationships and in parent-child relationships. For 
example, the topological relationship in WaterPollution indicates that when- 
ever a water station, a river, and a district are related in an instance of 
the relationship, they must overlap. Similarly, the topological relationship in 
the hierarchy of dimension District indicates that a district is covered by its 
parent province. 

Vaisman and Zimanyi m extended the Multidim model using the data 
types defined by Giiting et al. m to the classical base types. They consider 
time types (instant and periods), and four spatial data types , point, points, line, 
and region. Another new data type Moving captures the evolution over time 
of base types and spatial types. Moving types are obtained by applying a 
constructor moving(-). Hence, a value of type moving(point) is a continuous 
function / : instant — > point. Moving types have associated operations that 
generalize those of the non-temporal types. This is called lifting. For ex- 
ample, a distance function with signature moving) point) x moving) point) — > 
moving) real) calculates the distance between two moving points and gives as 
result a moving real, i.e., a real-valued function of time. Intuitively, the se- 
mantics of such lifted operations is that the result is computed at each time 
instant using the non-lifted operation. 




Fig. 1.3. A temporal dimension Pollutant 



Temporal dimension levels are identified by the LS pictogram, as shown 
in Figure I I .31 (taken from my Now, the level Pollutant is temporal, which 
means that a new pollutant may start to be monitored from a particular 
date. Temporal levels have a predefined attribute called lifespan, of type 
moving(bool), which keeps track of the validity of a member at each instant. 
Temporal attributes are identified by the VT pictogram. In the example, the 
attribute loadLimit is temporal, meaning that the load limit varies across 
time. 

Spatio-temporal objects are represented in the conceptual model as shown 
in Figure fOI (also taken from (HI). The Cloud dimension refers to clouds 
generated by industrial plants. Both the Cloud level and the commonArea 
measure have a geometry that is a moving region, indicated by the symbol 
‘m’. The latter is called a derived measure, i.e., in an instance of the fact 
relationship that relates a cloud c, a district d 7 and a date t , the measure 
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keeps the restriction of the trajectory of the cloud at that date and over that 
district. 




Fig. 1.4. A fact relationship with a spatio-temporal measure and a spatio-temporal 
dimension 



1.2.3 Trajectory Data Warehousing 

Moving objects representation and computing have received a fair share of 
attention over recent years in the database community m- The behavior 
of all these moving objects is traceable by means of electronic devices. We 
may consider moving objects in two ways: object samples or trajectories. In 
order to appropriately analyze the latter, some form of interpolation may be 
necessary. If we assume that moving objects data are captured at a given 
time interval, with a certain granularity, the trajectory of a moving object 
(MO) is given by samples composed by a finite number of tuples of the 
form < Oid,t,x,y >, stating that at a certain point in time, namely t, the 
object Old was located at coordinates ( x,y ). Trajectory information can be 
collected and organized in Trajectory Data Warehouses (TDW) in order to be 
exploited through OLAP and data mining techniques. Applications involving 
MO analysis include traffic analysis, truck fleet behavior, commuter traffic in 
a city, passenger traffic in an airport, or shopping behavior in a mall. All of 
these applications involve aggregation of trajectory data. 

The notion of TDW was introduced by Orlando et al. m . This notion is 
aimed at providing the infrastructure needed to deliver advanced reporting 
capabilities and facilitating the use of mining algorithms on aggregated tra- 
jectory data. In short, a TDW allows analyzing measures of interest like the 
number of moving objects in different urban areas, average speed, or speed 
change. Over the TDW, data mining techniques can also be used to discover 
traffic-related patterns. An ETL (Extract-Transform-Load) procedure feeds 
a TDW with aggregate trajectory data, obtained from raw data consisting 
in the spatio-temporal positions of moving objects. Finally, a data cube is 
built from the TDW, aggregating measures for OLAP purposes. The trajec- 
tories to be analyzed present characteristics of different kinds: numeric (such 
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as the average speed, direction, duration); spatial (geometric shape of the 
trajectory), and temporal (timing of the movement). 

In order to support trajectory data, a spatio-temporal data cube should al- 
low analysis along (a) temporal dimensions; (b) spatial dimensions at different 
levels of granularity (point, cell, road); (c) thematic dimensions, containing, 
for instance, demographic data. In this sense, hierarchies must take into ac- 
count the fact that an element may rollup to more than one in an upper level. 
For instance, a road can probably cross more that one cell, yielding a relation 
instead of a function between a level cell to a level road. 

Typically, since the space is usually divided into cells, measures of interest 
in a TDW are, among other ones: (a) The number of trajectories found in the 
cell (or started/ended their path in the cell, or crossed/entered/left the cell); 
(b) The average (or minimum or maximum) distance covered by trajectories 
in a cell; (c) The average (or minimum or maximum) time required to cover 
the distance in (b); The speed and change of speed (acceleration), direction 
and change of direction (turn). Finally, a TDW algebra should support typical 
OLAP operators like roll-up, drill-down, and slice and dice. 

From a modeling point of view, a TDW is based on the classic star schema 
p. It contains a standard temporal dimension, and two spatial dimensions. 
The former ranges over equally sized time intervals, which are aggregated ac- 
cording to larger intervals as we move up in the dimension hierarchy (e.g., the 
interval [60,120] aggregates over the interval [0,120]). The spatial dimensions, 
denoted DimX and DimY, range over equally sized spatial intervals (over the 
axes x and y, respectively), defining the cells where measures are recorded. 
There is a fact table containing references to the dimensions and measures of 
the kinds commented above. Roll-up and drill-down are performed aggregat- 
ing measures over the cells at different granularities (for instance, combining 
two or more cells). 

Figure II .61 taken from Damiani et al. m depicts the TDW architecture. 
Initially, location data are captured, and handled by a so-called trajectory 
stream manager, which builds trajectories from these data (e.g., splitting the 
raw data according to some criteria), providing a trajectory identifier. This 
process is called trajectory reconstruction. Trajectories are stored in a rela- 
tional table, denoted RelTrajectories, and then loaded into a moving object 
database (MOD). Basically, the MOD includes a relation MODTrajectories 
with schema (Oid .trajectory id, trajectory), where trajectory is of a special 
type Moving Point. 

The ETL process is as follows: Initially, raw location data (usually arriving 
as a continuous data stream) is transformed into trajectory data. In other 
words, this step is aimed at determining the starting and ending points of 
a trajectory. The solution consists in splitting the bulk data according to 
certain assumptions. For example: (a) Temporal gap (maximum time gap 
between two points in the same trajectory); (b) Spatial gap (maximum spatial 
distance between two points); (c) Maximum speed (used to detect noise); (d) 
Maximum noise duration (if there is a long sequence of noisy observations, 
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a new trajectory is generated); (e) Tolerance distance D (if two observations 
are closer than a certain distance D, the latest one is considered redundant). 

The TDW can be exploited using OLAP techniques. The aggregated mea- 
sures allow us to obtain, for example, the variable number of moving ob- 
jects in different urban areas. Raffaeta et. al. m presented T- Warehouse, 
an implementation that allows analyzing trajectory data at different levels of 
aggregation. 

Marketos and Theodoridis m present an extension of the OLAP data 
model for TDW with the following features: (a) A flexible fact table able to 
answer queries considering different semantic definitions of trajectories; (b) A 
parameter that supports the choice of semantics for aggregation queries over 
trajectory data; (c) An ETL method loading raw location data in the flexible 
data cube; (d) OLAP techniques to support the different visions explained 
above. The proposal is denoted ad-hoc OLAP. 
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1.2.4 Next Steps: Handling Raster Data 

Advances in data analysis technologies raises new challenges for the BI and 
GIS research communities. One of them is the need to handle continuous 
fields , which describe physical phenomena that change continuously in time 
and/or space. Examples of such phenomena are temperature, pressure, and 
land elevation. Besides physical geography, continuous fields (from now on, 
fields), like land use and population density, are used in human geography 
as an aid in spatial decision-making process. Formally, a field is defined as 
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composed of !221 : ( a ) a domain T> which is a continuous set; (b) a range of 
values 1Z\ and (c) a mapping function / from V to 1Z. 

In his pioneering work on defining algebra for fields , Tomlin m proposed 
a so-called map algebra, based on the notion that a map is used to represent 
a continuous variable (e.g., temperature). There are three types of functions 
in Map algebra: local , focal , and zonal. Local functions compute a value at 
a certain location as a function of the value(s) at this location in other map 
layer(s). Focal functions compute each location’s value as a function of ex- 
isting values in the neighboring locations of existing layers (i.e., they are 
characterized by the topological predicate touches ). Zonal functions (charac- 
terized by the topological predicate inside), compute a location’s new value 
from one layer (containing the values for a variable), associated to the zone 
(in another map) containing the location. Camara et al. |22j and Cordeiro et 
al. m formalized and extended these functions, supporting more topological 
predicates. Mennis et al. EH! extended map algebra operators for querying 
time- varying fields. 

Although some work has been done to support querying fields in GIS, spa- 
tial multidimensional analysis of continuous data is still in its infancy. Exist- 
ing multidimensional models dealing with discrete data are not adequate for 
the analysis of continuous phenomena. Multidimensional models and associ- 
ated query languages are thus needed, to support continuous data. Vaisman 
and Zimanyi m presented a conceptual model for SOLAP that supports 
dimensions and measures representing continuous fields, and characterized 
multidimensional queries over fields. They defined a field data type, a set of 
associated operations, and a multidimensional calculus supporting this data 
type. Later on, Gomez at al. m proposed an implementation for the opera- 
tors included in the model. 

Figure If. (11 taken from m, shows how the conceptual model described 
above was extended to support fields, by adding the notions of field dimensions 
and field measures. There are pictograms for temporal and non-temporal field 
levels and measures. A field dimension is a dimension containing at least one 
level that is a field. In our example, the field dimensions are Elevation, SoilType, 
Temperature, and Precipitation where the latter two are temporal field dimen- 
sions. A field measure is a measure represented by a continuous field. For exam- 
ple, suitability is a field measure computed in terms of elements in the model, 
e.g., the suitability at a certain point can be a function of the kind of crop, 
temperature, precipitation, and elevation, at that point or its vicinity. Finally, 
a field hierarchy is a set of related field levels; it allows a field to be seen at 
different granularities. 

Notice that in this approach, field dimensions deserve particular treatment. 
In traditional multidimensional models, every dimension is connected to at 
least one fact relationship. The same approach has been followed in models 
introducing fields in spatial data warehouses (e.g., m), where dimension 
instances are values in the underlying domain (that may be obtained through 
on-the-fly interpolation). Due to the nature of continuous fields, there may be 
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Fig. 1.6. An example of a Spatial data warehouse with continuous fields 



an infinite number of instances, each one corresponding to one possible value 
of the domain. Vaisman and Zimanyi m follow a different approach: they 
define a field dimension containing only one instance (corresponding to the 
function), and the attributes of the field dimension correspond to metadata 
describing it, like the units at which the values are recorded (e.g., Celsius 
or Farenheit for temperature) . Consequently, field dimensions are part of the 
model, but are not tied to any particular fact table. 

To implement the model, the authors defined two new data types, de- 
noted field and tempfield, and their corresponding operations, again, along 
the lines of Giiting et al. m- Field types capture the variation in space of 
base types. They are obtained by applying a constructor field(-). Hence, a 
value of type field(real) (e.g., representing altitude) is a continuous function 
/ : point — > real. The new type has a set of associated operators. For exam- 
ple, the defspace operator receives a field, and returns the geometry defining 
it; rangevalues receives a field, and returns the set of values that the func- 
tion takes. Aggregation operators take a field as argument and produce a 
scalar value. Operations fmin and fmax give, respectively, the minimum and 
maximum value taken by the function. The Map Algebra operators are also 
supported. 

There is still much work to do in this topic. In real-world practice, scientists 
and practitioners register the values of a field taking samples at (generally) 
fixed locations, and inferring the values at other points in space using some 
interpolation method. The raster model is just one of the discrete data mod- 
els proposed to represent fields based on sampling and interpolation. The 
Map Algebras commented above, have been proposed to work over raster 
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representations, although Triangular Irregular Networks (TIN) and 

Voronoi diagrams m are becoming increasingly popular. Moreover, the argu- 
ments of the operators must be fields of the same kind. One research direction 
therefore can be to define a general algebra for spatio-temporal multidimen- 
sional data over continuous fields, independently of the underlying represen- 
tation of such data. These algebras should easily support new representations 
that may appear in the future. 



1.3 Real-Time Data Warehousing 

1.3.1 The Need for Real-Time Data Warehousing 

The typical method of updating data in a DW (or Data Mart) consists in 
implementing the well-known extraction, transformation and load (ETL) pro- 
cess (discussed above in the framework of trajectory DW). In a nutshell, ETL 
tools pull data from source systems periodically e.g., daily, weekly, monthly), 
providing a snapshot of the business data at a given moment in time. These 
batch data are then loaded into the data warehouse tables. During each cycle, 
the warehouse tables are refreshed and the process is repeated (in general, no 
matter whether the data has changed or not). Historically, this process has 
been considered acceptable, since it was next to impossible to get Real-Time 
(RT), continuous DW feeds from production systems. Moreover, it was diffi- 
cult to get consistent, reliable results from query analysis if warehouse data 
was constantly changing. Nowadays, however, DW users want current and 
up-to-date BI information. In addition, while in the early days only selected 
users accessed most DWs, in today’s web-based architectures large volumes of 
concurrent requests must be handled maintaining consistent query response 
times, and must scale seamlessly as the data volume and number of users 
grows continuously. Moreover, DW need to remain available 7 x 24. To be 
more technical, let us comment on the lifecycle of a data record in a BI envi- 
ronment. The cycle starts with a business event taking place. ETL routines 
then deliver the event record to the DW. Finally, analytical processing turns 
the data into information, to help the decision-making process, and a business 
decision leads to a corresponding action. To approach Real Time, the time 
elapsed between the event and its consequent action (called the data latency ) 
needs to be minimized. In the general case, it is the data acquisition process 
that introduces majority of the data latency. Therefore, to support real-time 
BI, real-time DW are needed (RTDW) fT7l . Examples of these needs were dis- 
cussed by Schneider ESI, and include for instance: (a) Collaborative filtering, 
e.g., with queries such as “People who like X also like Y”. Here, the timeli- 
ness (freshness of data) is in the range of hours; (b) Fraud Detection. Detects 
anomalies in credit card usage. Timeliness here is in the order of minutes; 
(c) Call Center applications, e.g., to provide next best offer or action. In this 
applications, timeliness is again, minutes, (d) Web Page Usage analysis, e.g., 
by property, geography, user demographics, referrer, etc. Timeliness in this 
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case in in the range of hours or one day at most, (e) Business Activity Mon- 
itoring and Operational Performance Management (e.g., real-time inventory 
analysis). Timeliness would be in the order of minutes. 

Making rapid decisions based on large volumes of data requires achieving 
low data latency, sometimes at the expense of potential data inconsistency 
(e.g., late, missing data) and very high availability requirements, and spe- 
cialized hardware. Further, integration with other sources can be a challenge. 
We next discuss how the DW community has addressed these challenges. 
However, the reader must be aware of the fact that, in spite that many appli- 
cations require very low latency access, on the other hand, most applications 
do still not require these latency levels (that may come down to the seconds 
granularity). For example, some applications like targeting, alerting, recom- 
mendations, may demand for no so fresh data. In these cases, the common 
evolution strategy is to increase frequency of ETL operations using mini- 
batch ETL, e.g., loading data every 10 minutes. 

1.3.2 Strategies for Enabling Real Time ETL 

Several alternatives have been devised to achieve RT ETL in order to reduce 
data latency. The simplest one, which requires the least effort in terms of 
changes to existing architectures, is the one called Near-Real Time ETL. 
Near RT ETL tools simply increase the frequency of ETL without changing 
the load process, data models and reporting applications. Most work in the 
field follows this approach (see e.g. (SSI). However, this is not enough when 
data latency must be drastically reduced, and other alternative paths must 
be followed. These paths take advantage of the fact that data warehouses 
become more ‘operational’ in a RT data scenario, and transformations occur 
at the data warehouse, thus reducing data and analysis latency, eliminating 
the need to aggregate updated data on a centralized server until it is batch- 
processed. The former led to alternatives for approaches like the Direct Trickle 
Feed (DTF). Here, new data from the operational sources are continuously 
fed into the data warehouse This is done by either directly inserting data in 
the fact tables, or by inserting data into separate fact tables in a dedicated 
real-time partition. A variant of this strategy, which addresses the mixed 
workload problem introduced by DTF, is the one called Trickle and Flip. 
In this approach, instead of loading the data in real-time into the actual 
warehouse tables, data are continuously fed into staging tables that are an 
exact copy of the target tables. Periodically, feeding is stopped, and the copy 
is swapped with the fact table, bringing the DW up-to-date. Although cycle 
times range from hours to minutes, the best reported performances range 
between five and ten minutes. Finally, the Real-Time Data Caching (RTDC) 
strategy EH, although costly, completely avoids mixed workload problems: 
a RTDC consists in a dedicated database server (or a separate instance of 
a large database system) dedicated to loading, storing, and processing the 
RT data. In-Memory databases could be used for RTDC for large volumes of 
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RT data (in the order of hundreds or thousands of changes per second), or 
extremely fast query performance requirements Here, the RT data is loaded 
into the cache as it arrives from the source system. A drawback of this strategy 
is that, since the RT and historical data are separately stored, when a query 
involves both kinds of data, its evaluation could be costly. 



Modeling RT Fact Tables 

Physical modeling help the strategies above in reducing data latency. A clas- 
sic solution consists in define Real-Time Fact Table partitions, where RT 
and historical data are stored in separate fact tables m This partition is 
physically and administratively separated from the Data Warehouse tables, 
and it is subject to special update and query rules. The RT partition must 
resemble as much as possible the static data warehouse fact tables, and must 
contain all the activity that has occurred since the last update of the static 
data warehouse. Query tools should be able to distinguish both kinds of ta- 
bles, and know where to find data. That means, the BI tools must be smart 
enough to formulate a query to drill across the static fact tables, and include 
the up-to-the-second records from the real-time partition. This is not always 
achieved by commercial BI tools, however. 

An alternative to this solution consists in modeling with an external Real- 
Time data cache. In this case, no new modeling is required, and the RDTC 
has the same structure as historical data. 



1.3.3 ETL or CFT? 

In a scenario where data refreshment tools must provide real-time access, the 
larger the data warehouse, the longer it takes to refresh with the traditional 
ETL procedure. More than often, the volume of data being loaded into the 
warehouse exceeds the allocated batch updating window. Then, in an environ- 
ment where 20 per cent of operational data change every week or month, some 
practitioners and vendors propose a Capture, Transform and Flow (CTF) so- 
lution that captures, transforms and flows data in real-time into an efficient, 
continuously refreshed data warehouse m ■ We discuss this next. 



Data Capture 

With advanced CTF solutions, every time an add, change or delete occurs 
in the production environment, it is automatically captured and integrated 
or pushed in real-time to the data warehouse. Beyond real-time integration, 
change data capture can also be done periodically. Data can be captured and 
then stored until a predetermined integration time. For example, an orga- 
nization may schedule its refreshes of full tables or changes to tables to be 
integrated hourly or nightly. Only data that has changed since the previ- 
ous integration needs to be transformed and transported to the subscriber. 
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The data warehouse can therefore be kept current and consistent with the 
source databases. Figure I I .71 depicts the procedure followed by the Oracle 
Data Integration TooJ|. First, an identified subscriber (e.g., an integration 
process) subscribes to changes that might occur in a datastore. The Changed 
Data Capture framework then captures changes in the datastore and pub- 
lishes them for the subscriber. Finally, the subscriber can process the tracked 
changes at any time and consume these events. Once consumed, events are 
no longer available for this subscriber. 



Data Transformation 

Transformational data integration software can conduct individual tasks such 
as translating values, deriving new calculated fields, joining tables at source, 
converting date fields, and reformatting field sizes, table names and data 
types. All of these functions allow for code conversion, removal of ambiguity 
and confusion associated with data, standardization, measurement conver- 
sions, and consolidating dissimilar data structures for data consistency. 



Data Flow 

This refers to refreshing the feed of transformed data in real-time from multi- 
ple operational systems to one or more subscriber systems. The flow process 
is a continuous stream of information as opposed to the batch loading of data 
performed by ETL tools. 
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Fig. 1.7. Change Data Capture 



1.3.4 Right Time Data Warehousing 

Thomsen et al. m proposed the RiTE architecture for what they denoted 
Right Time DW (see Figure EED. Here, a so-called RiTE ‘Catalyst’ module 
is added to the classical ETL architecture. The rationale is based on the fact 

http : //www. oracle . com 
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that some data must be fresh while other data can support higher latency, 
that is, parts of the data must be loaded quickly after arrival, while other 
parts can be loaded at regular intervals In other words, the data is loaded at 
the right time , therefore the approach is called “right-time data warehousing” , 
opposite to “near-real time DW” , were data is loaded into the DW minutes or 
seconds after it arrives. In both approaches, regular SQL INSERT statements 
are used, leading to slow insert speed. An appropriate solution should find 
the correct batch size between the two extremes (bulk load vs. single-row 
INSERT). The RiTE architecture includes: 

• A specialized JDBC database driver for the producer 

• A specialized JDBC database driver for the consumers 

• A main-memory based catalyst which provides intermediate storage 
(memory tables) for (user-chosen) DW tables, offers fast insertions and 
concurrency control. Besides, data can be queried while held by memory 
tables, transparently to the end user. Eventually the data is moved to its 
final target the physical DW tables 

• A PostgreSQLtable function makes the data available in the DW 

The data producer uses a specialized database driver which handles IN- 
SERT and COMMIT operations in a particular way. Other operations are 
executed traditionally. The RiTE prototype treats prepared statements in- 
serting scalars into memory tables, and can handle typical DW fact tables 
The values are not inserted directly into the DW, but instead kept in a lo- 
cal buffer. Later the data is flushed and reaches the memory table Finally, 
the data is materialized and reaches the DW tables. The default is to flush 
immediately after the commit although another policy waits and temporar- 
ily places the data at a file on the producer side (this is called the “lazy 
commit” ) . 




Fig. 1.8. The RiTE Architecture 
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1.3.5 Future Steps 

In spite of the work commented here, we believe that there is a lot of room 
for fruitful research here. Most of the work has been done in the ETL process, 
mostly at the physical level. However, less work has analyzed whether or not 
the current data models are still appropriate when data latency is a hard 
requirement. Therefore, comprehensive models and frameworks are needed 
to allow RTDW on solid theoretical basis. Re-reading the needs stated in 
M, we can see that this area has evolved, from the research and scientific 
point of view, more slowly than other areas (like Spatial OLAP and BI), were 
solid scientific results were achieved. 

1.4 Semantic Web Data Warehouses 

The Semantic Web (SW) is a proposal oriented to represent Web content in 
an easily machine-processable way. The basic layer of the data representation 
for the Semantic Web recommended by the World Wide Web Consortium 
(W3C) is the Resource Description Framework (RDF) [I I j . The Ontology 
Web Language (owlJJ is a language for the specification of ontologies, whose 
definition by the W3C Consortium has encouraged different communities to 
develop large and complex ontologies like the NCI thesaurus, GALEN, etc. 
OWL provides a powerful knowledge representation language that has a clean 
and well defined semantics based on Description Logics (DL). In a Semantic 
Web scenario, domain ontologies (defined in RDF or some variant of OWL) 
are aimed at defining a common terminology for the concepts involved in a 
particular domain. Semantic annotations are especially useful for describing 
unstructured, semi-structured and text data. Many applications attach meta- 
data and semantic annotations to the information they produce (for example, 
in medical applications, medical image, laboratory tests). In the near future, 
large repositories of semantically annotated data will be available, opening 
new opportunities for enhancing current decision support systems. Perez et 
al. [U3 provide a good survey of the efforts made for integrating web data 
analysis into OLAP tools and techniques, although oriented in a different 
direction than the present discussion: the authors focus on XML data ware- 
houses and tools, while we are aimed at discussing pure semantic web data 
analysis. Two lines of work are clearly identified. One focuses in multidimen- 
sional design automation taking advantage of existing ontologies. The other 
one studies methods aimed at analyzing the large amounts of semantic web 
data using OLAP tools. We believe the latter is a very promising line of work. 

1.4.1 DW Design Automation from Ontologies 

There is a large corpus of work aimed at using ontologies for automatic DW 
design. As an example, Niinimaki and Niemi m use semantic web technolo- 
gies to populate OLAP cubes. They use ontology mapping to convert data 

3 http : //www. w3 . org/TR/owl2- overview/ 
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sources to RDF and then query this RDF data with SPARQL3 to populate 
the OLAP schema. The method uses an ontology and mapping files that con- 
nect the data sources with the ontology. The ETL process is guided by the 
ontology. The authors create an OLAP ontology (e.g., defining that there are 
dimensions and measures, and that the measure is a function of the dimen- 
sions). From this ontology they build application specific sub-ontologies. Data 
are then formatted to conform to these ontologies. Ontologies are expressed 
in RDF and OWL. Along the same lines, Romero and Abello m address the 
design of the data warehouse starting from an OWL ontology that describes 
the data sources. They identify the dimensions that characterize a central 
concept under analysis (the fact concept) by looking for concepts connected 
to it through one-to-many relationships. The same idea is used for discover- 
ing the different levels of the dimension hierarchies, starting from the concept 
that represents the base level. In this work the input ontology indicates the 
multiplicity of each role in the relationships; and a matrix keeps, for each 
concept, all the concepts that are related by means of a series of one-to-many 
relationships. The output of the Romero and Abello’s method is a star or 
snowflake schema that guaranties the summarizability of the data, suitable 
to be instantiated in a traditional multidimensional database. The applica- 
tion of this work is valid in scenarios where a single ontology of reduced 
size, with multiplicity restrictions, is used for annotating the source data. 
Note that multiplicity information is rarely found in the source ontologies. 
Schematically, the method is composed of three clearly defined steps, each 
one aimed at discovering an specific multidimensional concept automatically. 
At the end of each step the method asks for the end-user multidimensional 
requirements (the only non-automatic part of the process). The first task 
looks for potential facts. Therefore, concepts with most potential dimensions 
and measures are good candidates. At the end of this task, the user is re- 
quired to choose the subjects of interest among the concepts proposed as 
potential Facts. The second task points out sets of concepts likely to be used 
as a base for each identified fact. A base is defined as a collection of concepts 
labeled as potential dimensions. That is, the method now looks for concepts 
being able to univocally identify objects of analysis (i.e., factual data). The 
third task produces dimension hierarchies. For every concept identified as a 
dimension, a hierarchy is built from those concepts related to each other by 
typical whole-part relationships. 

1.4.2 Analyzing Ontology Instances 

Following the second line of research that we mentioned above, Nebot el al. 
m proposed a semi-automatic method for on-demand extracting semantic 
data into a multidimensional database. In this way, data could be analyzed 
using traditional OLAP techniques. In other words, an ETL process for Se- 
mantic Web data sources. Figure E3 taken from the mentioned work, shows 
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a very general sketch of how ontology data are transformed into multidimen- 
sional data. The authors present a method for identifying, transforming and 
combining semantic data in order to discover facts and populate a MD model 
with such facts. Figure IT. 1 1 II depicts the general architecture of the proposal. 
The authors assume that data is represented as an OWL ontology. Here, data 
are separated into the TBox (the ‘schema’), and the ABox (the ‘instance’). 
The former is usually called the ‘ontology’, while the latter are the “semantic 
annotations’, represented as triples where the subject is always an instance, 
an object can be either an instance or a literal, and the predicate can be 
either an object property or a data type property. Figure II . 1(11 shows the on- 
tology and the annotations, from which data to populate the fact table are 
indicated. There are three main phases in the methodology: (1) Designing 
a MD schema. The user selects the subject of analysis, which corresponds 
to a concept of the ontology. Then, she selects potential dimensions, either 
a named concept or data type property. Finally, she defines the measures, 
which are functions over data type properties. (2) The fact extractor identi- 
fies and extracts facts from the instance store according to the MD schema 
previously designed, producing the base fact table of a data warehouse. (3) 
The user specifies MD queries over the data warehouse. The MD query is 
executed and a cube is built. Then, typical OLAP operations can be applied 
over the cube. 



Semantic Data MD Fact Table 

OWL/RDF(S) stores 




Fig. 1.9. From Ontologies to Multidimensional Data 



1.4.3 Scaling OLAP for the Semantic Web Analysis 

We have mentioned previous efforts on analyzing semantic web data using 
OLAP tools. In a semantic web context, where data are represented as an 
RDF labeled graph (i.e. , binary relations), analytical queries consist of three 
main constructs namely graph pattern matching, grouping and aggregation. 
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Fig. 1.10. Architecture 



Therefore, join operations are needed to transform data into n-ary relations 
relevant to the given query. On the contrary, in traditional OLAP, data are 
organized in a structured fashion. Processing semantic web data implies that 
powerful processing mechanisms are needed, given that large joins must be 
computed. Thus, parallel processing systems are needed. We commented in 
Section o that implementations of Google’s MapReduce m are gaining 
success in processing large analytical workloads, although the model is not 
appropriate to compute costly joins. 

Remark 1. As a solution to the problem of join computing, a new hybrid ap- 
proach, HadoopDB m, is becoming increasingly popular. It combines the 
performance and efficiency advantages of parallel databases with the scala- 
bility and fault-tolerance advantages of MapReduce systems. Note however 
that HadoopDB partitions the data as required by the query and pushes as 
many operations as possible to the database, which reduces the improvement 
to the processing of the first reduce task, which usually marks the end of the 
first join operation. This is not enough for RDF processing. Therefore, Afrati 
and Ullman m optimize multi-way joins by providing strategies to efficiently 
partition and replication of the tuples of a relation on reducer processes, min- 
imizing the communication cost. Other query optimization techniques used 
in RDF query processing, consist in materializing the transitive closure of 
the RDF graph. □ 

In the MapReduce programming model, tasks are encoded using two indepen- 
dent functions: Map and Reduce, such that their execution can be parallelized 
on a cluster of machines. The Map function reads each line in the data and 
transforms it into a (key, value) pair. After all mapping tasks are completed, 
the intermediate keys are sorted and merged. The objects with the same 
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key i.e. (key, list(values )) are collected by an specific reducer. The Reduce 
function focuses on performing aggregation operations on this grouped data. 
Sometimes a combiner function may be used to perform partial aggregations 
in the map phase. However, this model was devised with a single large data 
file in mind, not in a Semantic Web scenario, where matching graph patterns 
results in a large number of joins. Graph pattern matching is the typical 
query evaluation mechanism of SPARQL, the standard query language for 
the Semantic Wetfl 

To enhance parallelism of analytical processing on RDF graphs, Sridhar 
et al. 152, proposed a dataflow language, denoted RAPID, which extends Ya- 
hoo’s Pig Latin language with query primitives for dealing with the graph 
structured nature of RDF. They also propose a technique to express complex 
analytical queries in a way that allows optimization and enhances paral- 
lelization on MapReduce platforms. This technique integrates into Pig Latin 
primitives for implementing the MD-join operator |23| ■ As an example, Fig- 
ure taken from m shows a very simple fact table, expressed as an RDF 
graph, that can lead to difficulties to evaluate aggregate queries like “Number 
of sales, for each product and month of 2000, such that sales were between 
the previous and following months’ average sales”. 
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Fig. 1.11. A simple fact table in RDF 



Ravindra et al. gi extend the previously commented work developing 
User Defined Functions (UDFs) , through basically two mechanisms: function 

4 http : //www. w3 . org/TR/rdf-sparql-query/ 
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coalescing (i.e., merging MapReduce functions into a single one, to reduce 
function calls and I/O) and look-ahead processing. These UDFs have been 
integrated into the Pig Latin function library and the experimental results, 
according to the authors, show up to a 50% improvement in execution times 
for certain classes of queries. 

1.5 Conclusion 

We have discussed new directions in DW and BI, addressing quite consol- 
idated fields, like spatio-temporal BI, topics that in spite of having been 
detected as critical (like real-time DW) have not produced relevant results so 
far, and topics that we expect will receive attention in the near future, like 
Semantic Web DW. From our discussion, it follows that future research will 
focus on scalability and performance, given the increasing amount of available 
and heterogeneous data. More formal work on Real-Time DW, particularly 
in modeling is needed, as well as models and methodologies for DW and BI 
on the Semantic Web. In the latter case, proposals so far are ad-hoc and rely- 
ing on transforming RDF graphs and ontologies into standard Star-schemas. 
Finally, new domains for BI are slowly emerging, like image and multimedia 
OLAP [23, and will also require new models and algorithms to be developed. 
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Summary. Data stored in a data warehouse (DW) are retrieved and analyzed by 
complex analytical applications, often expressed by means of star queries. Such 
queries often scan huge volumes of data and are computationally complex. For 
this reason, an acceptable (or good) DW performance is one of the important 
features that must be guaranteed for DW users. Good DW performance can be 
achieved in multiple components of a DW architecture, starting from hardware (e.g., 
parallel processing on multiple nodes, fast disks, huge main memory, fast multi-core 
processor), through physical storage schemes (e.g., row storage, column storage, 
multidimensional store, data and index compression algorithms), state of the art 
techniques of query optimization (e.g., cost models and size estimation techniques, 
parallel query optimization and execution, join algorithms), and additional data 
structures improving data searching efficiency (e.g., indexes, materialized views, 
clusters, partitions). In this chapter we aim at presenting only a narrow aspect of 
the aforementioned technologies. We discuss three types of data structures, namely 
indexes (bitmap, join, and bitmap join), materialized views, and partitioned tables. 
We show how they are being applied in the process of executing star queries in three 
commercial database/data warehouse management systems, i.e., Oracle, DB2, and 
SQL Server. 

Keywords: data warehouse, star query, join index, bitmap index, bitmap join 
index, materialized view, query rewriting, data partitioning, Oracle, SQL Server, 
DB2. 

2.1 Introduction 

A data warehouse architecture has been developed in order to integrate 
and analyze data coming from multiple distributed, heterogeneous, and au- 
tonomous data sources (DSs), deployed throughout an enterprise. A core com- 
ponent of this architecture is a database, called a data warehouse (DW), that 
stores current and historical data, integrated from multiple DSs. The content 
of a DW is analyzed by various On-Line Analytical Processing (OLAP) ap- 
plications for the purpose of discovering trends (e.g., demand and sales of 
products), discovering patterns of behavior (e.g., customer habits, credit re- 
payment history) and anomalies (e.g., credit card usage) as well as for finding 
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dependencies between data (e.g., market basket analysis, suggested buying, 
insurance fee assessment). OLAP applications execute complex queries, some 
of them being predefined (e.g. reports), whereas others being executed ad-hoc 
by decision makers. 

The queries are expressed with multiple join, filtering, aggregate, and sort 
operations and they process large (or extremely large) volumes of data. A 
special class of analytical queries includes star queries that join a central 
table with multiple referenced tables. The execution of analytical queries 
may take hours or even days. For this reason, providing means for increasing 
the performance of a data warehouse for analytical queries and other types 
of data processing (e.g., data mining) is one of the important research and 
technological areas. 

2.1.1 Increasing DW Performance: Research and Technological 
Advances 

A DW performance depends on multiple components of a DW architecture. 
The basic components include: (1) hardware on which a database/data ware- 
house management system (DB/DWMS) is installed and computational ar- 
chitectures, (2) physical storage schemes of data, (3) robustness of a query 
optimizer, (4) additional data structures supporting faster data access. 

Hardware Architectures 

Hardware and various processing architectures, e.g., shared memory, shared 
disk, shared nothing, have a substantial impact on the performance of a DW. 
Multiple nodes with their processing power allow to process data in parallel. 
Modern cluster, grid, and cloud architectures take advantage of parallel data 
access and processing. A lot of research efforts focus on this area, e.g., 

H 0. Recent research and technological trends concentrate also on using 
parallel processing of powerful graphic processing units, e.g,. 01001313 for 
processing data. Another hot topic research and technological issue concerns 
main memory databases and data warehouses, e.g., p rrninrj . 



Physical Storage 

In practice, a DW is implemented either in a relational server (the ROLAP 
implementation) or in a multidimensional server (the MOLAP implementa- 
tion). The ROLAP implementation can be based either on a row storage 
(RS) (a typical one) or on a column storage (CS) (current trend) . RS is more 
suitable for transactional processing (inserts, deletes, updates) . CS offers bet- 
ter performance for applications that read and compute aggregates based on 
the subset of table columns, thus it is better suited for OLAP processing. 
An intensive research is conducted in the area of column store DWMSs, e.g., 
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P I H3 | IHj in order to develop efficient join algorithms, query materializa- 
tion techniques, index data structures, and compression techniques H3 , to 
mention some of them. These and many other efforts resulted in commer- 
cially available and open source column storage DWMS, e.g., Sybase IQ, 
EMC Greenplum, C-Store/Vertica, MonetDB, Sadas, FastBit, Model 204. 
The MOLAP implementation is based on other storage structures, like multi- 
dimensional arrays and multidimensional indexes. In this technology research 
is focused among others on developing efficient storage implementations, in- 
dex structures, and compression techniques, e.g., mm 

Query Optimizer 

The way queries are executed strongly impacts the performance of a DW. 
The process of optimizing query executions and building robust query op- 
timizers has been receiving substantial focus from the research community. 
Huge research literature exists on this issue, cf., [13- Recently, research in 
this area concentrates among others on join algorithms, e.g., □a, parallel 
query optimization and execution, e.g., 1213 EH, designing robust and more 
intelligent query optimizers 

Querying 

Typically, OLAP applications analyze all data in order to deliver reliable 
results. Nonetheless, if volumes of data are extremely large, some researchers 
propose to apply various techniques to computing approximate results, like 
for example (1) sampling, based on histograms or wavelets, e.g., J22GI!, (2) 
statistical properties of data, or (3) apply a probability density function, e.g., 

m 



Data Structures 

Query execution plans profit from additional data structures that make data 
searching faster and reduce the volume of data that has to be retrieved. 
Different data structures have been developed and applied in practice in 
commercial DWMSs. Some of them include: (1) various types of indexes, like 
join indexes, e.g., ESI, bitmap indexes, e.g., E2E£I, and bitmap join indexes, 
e.g., E3 EDI, (2) materialized views and query rewriting techniques, e.g., 
E3, (3) table partitioning, e.g., E21E3lE3j. In these areas multiple research 
works focus on compressing bitmap indexes, e.g., algorithms 

for materialized view selection and fast refreshing, e.g., IEB1 E3 E3 EE, and 
finding the most efficient partitioning schemes, e.g., El El- 

Testing Performance 

An important practical issue concerns the ability of evaluating the perfor- 
mance of a DWMS and compare multiple architectures with this respect. 
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To this end, multiple approaches to testing the performance and robustness 
of a DWMS have been developed, e.g., J3, S, Eh Et an d are being further 
intensively investigated. 

2.1.2 Chapter Focus 

In this chapter we overview a narrow area only of the huge research and 
technological area devoted to increasing a data warehouse performance. We 
focus on the aforementioned data structures (indexes, materialized views, 
and partitions) in ROLAP servers and illustrate their basic functionality and 
usage in commercial DWMSs, i.e. , Oracle, DB2, and SQL Server. 

The chapter is organized as follows. In Section FOl we present basic data 
warehouse concepts and an example data warehouse used throughout this 
chapter. In Section HP1 we present index data structures applied to the opti- 
mization of star queries. In Section 12.41 we discuss a technique of materializ- 
ing query results and using the materialized results for query optimization. 
In Section 12. hi we present table partitioning techniques. Finally, Section 12. (il 
includes the chapter summary. 

2.2 Basic Concepts 

2.2.1 DW Model and Schema 

In order to support various analyses, data stored in a DW are represented 
in a multidimensional data model jI71 iEj. In this model an elementary in- 
formation being the subject of analysis is called a fact. It contains numerical 
features, called measures that quantify the fact. Values of measures are an- 
alyzed in the context of dimensions. Dimensions often have a hierarchical 
structure composed of levels, such that Li — > Lj, where — > denotes hierarchi- 
cal assignment between a lower level Li and upper level Lj, also known as a 
roll-up or an aggregation path m ■ Following the aggregation path, data can 
be aggregated along a dimension hierarchy. 

The multidimensional model is often implemented in relational OLAP 
servers (ROLAP) ( HTH . where fact data are stored in a fact table, and level 
data are stored in dimension tables. In a ROLAP implementation two ba- 
sic types of conceptual schemas are used, i.e. a star schema and a snowflake 
schema m- In the star schema each dimension is composed of only one 
(typically denormalized) level table. In the snowflake schema each dimension 
is composed of multiple normalized level tables connected by foreign key - 
primary key relationships. 

An example star schema that will be used throughout this chapter is shown 
in Figure 12 . 1 1 It is composed of the Sales fact table, and three dimension 
tables, namely Products, Customers, and Time. The Sales fact table is con- 
nected with its dimension table via three foreign keys, namely ProductID, 
CustomerlD, and TimeKey. The fact table includes measure column Sale- 
sPrice. 
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dimension Customer 




Fig. 2.1. Example data warehouse star schema on sales of products 



2.2.2 Star Queries 

Based on a DW schema, analytical queries are executed in a DW. As men- 
tioned earlier, such queries typically join multiple tables, filter and sort data, 
as well as aggregate data at different levels of dimension hierarchies. Typ- 
ically, the queries join a fact table with multiple dimension tables and are 
called star queries. 

An example star query computing the yearly sales (in years 2009 and 2010) 
of products belonging to category ’electronic’ in countries of sales, is shown 
below. 

select sum(SalesPrice) , ProdName, Country, Year 

from Sales s, Products p, Customers c, Time t 

where s . Product ID=p . Product ID 

and s . CustomerID=c . CustomerlD 

and s .TimeKey=t .TimeKey 

and p. Category in (’electronic’) 

and t.Year in (2009, 2010) 

group by ProdName, Country, Year; 

2.3 Index Data Structures 

Star queries can profit from applying some indexes in the process of retrieving 
data. Three indexes, the most frequently used in practice include: a join index, 
a bitmap index, and a bitmap join index, which are outlined in this section. 

2.3.1 Join Index 

A join index represent the materialized join of two tables, say R and S. As 
defined in ED 122, a join index is a table composed of two attributes. It stores 
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the set of pairs ( r* , Sj) where r,; and Sj denote tuple identifiers from R and 
S , respectively, that join on a given predicate. In order to make searching 
the join index faster, it is assumed that the join index is physically ordered 
(clustered) by one of the attributes. Alternatively, the access to the join index 
can be organized by means of a B-tree or a hash index m- 

A join index can be created either on a join attribute or on a non-join 
attribute of a table. In the second case, the index is organized in such a 
way that it allows lookups by the value of the non-joined attribute in order 
to retrieve the ROWIDs of rows from the joined tables that join with the 
non-joined attribute value. 

In the context of a data warehouse, a join index is applied to joining a di- 
mension table and a fact table. The index is created either on a join attribute 
of a dimension table or on another attribute (typically storing unique values) 
of a dimension table. In order to illustrate the idea behind the join index let 
us consider the example below. 

Example 1. Let us consider tables Products and Sales from the DW schema 
shown in Figure 12 . 1 1 Their content is shown in Table rm For clarity pur- 
pose, both tables include also explicit column ROWID that stores physical 
addresses of records and that serve as row identifiers. 



Table 2.1. Example tables in the star schema on sales of products (from Fip T2. 1 1 



Sales 


Products 


ROWID 


SalesPrice 


Discount 


ProductID 


ROWID 


ProductID 


ProdName 


Category 


OAAO 




5 


100 


BFF1 


100 


HP Pavilion 


electronic 


0AA1 




15 


230 


BFF2 


230 


Dell Inspiron 


electronic 


0AA2 




5 


100 


BFF3 


300 


Acer Ferrari 


electronic 


0AA3 




10 


300 




0AA4 




10 


300 


0AA5 




15 


230 



The join index defined on column ProductID is shown in Table 12.21 



Table 2.2. Example join index on ProductID 



Products. ROWID 


Sales. ROWID 


BFF1 


OAAO 


BFF1 


0AA2 


BFF2 


0AA1 


BFF2 


0AA5 


BFF3 


0AA3 


BFF3 


0AA4 
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As one can observe from the above example, the join index stores a ma- 
terialized (precomputed) join of the Products and Sales tables. Thus, it will 
optimize queries like: 

select . . . 

from Sales s, Products p 

where s . ProductID=p . ProductID ... 



2.3.2 Bitmap Index 

OLAP queries not only join data, but also filter data by means of query 
predicates. Efficient filtering of large data volumes is well supported by the 
so-called bitmap indexes Eliza I2HI E| Conceptually, a bitmap index created 
on attribute a m of table T is organized as the collection of bitmaps. For each 
value vali in the domain of a m a separate bitmap is created. A bitmap is a 
vector of bits, where the number of bits equals to the number of records in 
table T. The values of bits in bitmap for vali are set as follows. The n-th 
bit is set to 1 if the value of attribute a m for the n-th record equals to vali. 
Otherwise the bit is set to 0. 

In order to illustrate the idea behind the bitmap index let us consider the 
example below. 

Example 2. Let us review the Sales fact table, shown in Table 12.31 The ta- 
ble contains attribute Discount whose domain includes three values, namely 
5, 10 , and 15, denoting a percent value of discount. A bitmap index cre- 
ated on this attribute will be composed of three bitmaps, noted as Bm5perc, 
BmlOperc, and Bml5perc, respectively, as shown in Table FOI 

The first bit in bitmap Bml5perc equals to 0 since the Discount value of 
the first record in table Sales does not equal 15. The second bit in bitmap 
Bml5perc equals to 1 since the Discount value of the second record in table 
Sales equals to 15, etc. 

Such bitmap index will offer a good response time for a query selecting 
for example data on sales with 5% or 15% discounts. In order to find sales 
records fulfilling this criterion, it is sufficient to OR bitmaps Bm5perc and 
Bml5perc in order to construct a result bitmap. Then, records pointed to by 
bits equal to T’ in the result bitmap are fetched from the Sales table. □ 

At the implementation level, bitmap indexes are organized either as B-trees 
m or as simple arrays in a binary file In the first case, B-tree leaves 
store bitmaps and a B-tree is a way to organize indexed values and bitmaps 
in files. 

Bitmap indexes allow to answer queries with the count function without 
accessing tables, since answers to such queries can be computed by sim- 
ply counting bits equaled to T’ a result bitmap. Moreover, such indexes of- 
fer a good query performance for attributes of narrow domains. For such 
attributes, a bitmap index will be much smaller than a traditional B-tree 
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Table 2.3. Example bitmap index created on the Discount attribute 



Sales 


bitmap index on Discount 


SalesPrice 


Discount 




Bm5perc 


BmlOperc 


Bml5perc 




5 




1 


0 


0 




15 




0 


0 


1 




5 




1 


0 


0 




10 




0 


1 


0 




10 




0 


1 


0 




15 




0 


0 


1 



index. Additionally, while evaluating queries with multiple predicates with 
equality and inequality operators, a system processes bitmaps very fast by 
AND-ing/OR-ing them. 

The size of a bitmap index strongly depends on the cardinality (domain 
width) of an indexed attribute, i.e. , the size of a bitmap index increases when 
the cardinality of an indexed attribute increases. Thus, for attributes of high 
cardinalities (wide domains) bitmap indexes become very large (much larger 
than a B-tree index). As a consequence, they cannot fit into main memory and 
the efficiency of accessing data with the support of such indexes deteriorates 

El- 

In order to reduce the size of bitmap indexes defined on attributes of high 
cardinalities, two following approaches have been proposed in the research 
literature, namely: (1) extensions to the structure of the basic bitmap index, 
and (2) bitmap index compression techniques. In the first approach, two main 
techniques, generally called binning as well as bit slicing, can be distinguished. 



Extensions to the Structure of the Basic Bitmap Index 

In ( 51 (called range-based bitmap indexing) and in (571 !5%1. 15 Qj (called bin- 
ning ), values of an indexed attribute are partitioned into ranges. A bitmap 
is constructed for representing a given range of values, rather than a distinct 
value. Bits in a single bitmap indicate whether the value of a given attribute 
of a row is within a specific range. This technique can also be applied when 
values of an indexed attribute are partitioned into sets. 

The technique proposed in m can be classified as a more general form of 
binning. In m sets of attribute values are represented together in a bitmap 
index. Such a technique reduces storage space for attributes of high cardinal- 
ities. The selection of attribute values represented in this kind of an index is 
based on query patterns and their frequencies, as well as on the distribution 
of attribute values. 

Another form of binning was proposed in m- This technique, called prop- 
erty maps, focuses on managing the total number of bins assigned to all in- 
dexed attributes. A property map defines properties on each attribute, such 
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as the set of queries using the attribute, distribution of values for the at- 
tribute, or encoded values of the attribute. The properties are represented as 
vectors of bits. A query processor needs extension in order to use property 
maps. Property maps support multi-attribute queries, inequality queries or 
high selectivity queries, and they are much smaller than bitmap indexes. 

The second technique is based on the so-called bit-sliced index jSDE2E5|- 
It is defined as an ordered list of bitmaps, B n , U ra_1 , . . . , B 1 , B°, that are used 
for representing values of a given attribute A, i.e. , B° represents the 2° bit, B 1 
represents the 2 1 bit, etc. Every value of an indexed attribute is represented 
on the same number of n bits. As a result, the encoded values in a table form 
n bitmaps. The bitmaps are called bit-slices. Data retrieval and computation 
are supported either by the bit-sliced index arithmetic m or by means of a 
dedicated retrieval function ESI- Additionally, a mapping data structure is re- 
quired for mapping the encoded values into their real values 1551 - 



Compression Techniques 

The second approach that allows to reduce the size of a bitmap index de- 
fined on an attribute of high cardinality is based on compression. Four main 
loss-less techniques can be distinguished in the research literature, namely: 
(1) Byte-aligned Bitmap Compression (BBC) jfi-'lj . (2) Word- Aligned Hybrid 
(WAH) 015511511551 , (3) Position List WAH (PLWAH) 0 ^, and (4) 
RL-Huffman (251 an d RLH JjH, E2 ■ 

All the aforementioned compression techniques apply the run-length en- 
coding. The basic idea of the run-length encoding consists in encoding con- 
tinuous vectors of bits having the same value (either “0” or “1”) into: (1) a 
common value of all bits in the vector (i.e., either “0” for a vector composed 
of zeros or “1” for a vector composed of ones) and (2) the length of the vector 
(i.e., the number of bits having the same value). Before encoding, a bitmap 
is divided into words. Next, words are grouped into the so-called runs. The 
run is composed of words that can be either a fill or a tail. The fill represents 
series of words that are composed of bits of the same value. The tail repre- 
sents the series of words that are composed of both “0” and “1” bits. Fills 
are compressed because of their homogeneous content, whereas tails are not. 

BBC divides bit vectors into 8-bit words, WAH and PLWAH divide them 
into 31-bit words, RLH uses words of a parameterized length, whereas RL- 
Huffman does not divide a bitmap into words. PLWAH is the modification 
of WAH. PLWAH improves compression if tail T that follows fill F differs 
from F on few bits only. In such a case, the fill word encodes the difference 
between T and F on some dedicated bits. Moreover, BBC uses four different 
types of runs, depending on the length of a fill and the structure of a tail, 
whereas the other compression techniques use only one type of a run. The 
overall idea of the WAH compression is illustrated with the example below 
taken from m- 
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Example 3. For the sake of simplicity let us assume that a 32-bit processor 
is used. A bitmap being compressed is composed of 5456 bits, as shown 
in Figure 12. 2h . The WAH compression of the bitmap is executed in three 
following steps. 

In the first step, the bitmap is divided into groups composed of 31 bits, 
as shown in Figure 12. 2h . In the example, 176 such groups are created. In 
the second step, adjacent groups containing identical bits are merged into 
one group, as shown in Figure POI- . Since group 1 is heterogeneous, i.e. , it 
is composed of “0” and “1” bits, it is not merged with a group following it. 
Groups 2 to 175 are homogeneous (composed of “0” bits) and they are merged 
into one large group, denoted in Figure 12.21 as group 2-175. This group 
includes 174*31 bits. The last group 176, similarly as group 1, is heterogeneous 
and it cannot be merged with a group preceding it. As the result of group 
merging, three final groups are created, as shown in Figure 



a) an example bitmap being compressed (5456 bits) 

100000 .. .. 0000001110000111,00000000000000000000 00000000000,00111111111111 1111011111 

I ll ll I 



31 bits 


5394 bits having value "0" 


31 bits 


b) dividing the bitmap into 31 -bits groups 






31 bits 


31 bits [ 


31 bits 


group 1 


group 2 


group 176 


c) merging adjacent homogeneous groups 




31 bits 


174*31 bits 


31 bits 



9 rou P 1 group 2-175 group 176 

d) group encoding by means of a 32-bits word 




run 1 run 2 



Fig. 2.2. The steps of the WAH compression 

In the third step, the three final groups are encoded on 32-bit words as 
follows (cf. Figure I2.2H 1. The first group represents the tail of the first run. 
The most significant bit (the leftmost one) has value “0” denoting a tail. Next 
31 bits are original bits of group 1. The second group ( group 2-175) represents 
the fill of the second run. The most significant bit (at position 31 ) is set to “1” 
denoting a fill. The bit at position 2 30 is set to “0” denoting that all bits in 
original group 2-175 have value “0” , i.e., the fill is used for compressing groups 
whose all bits have value “0”. The remaining 30 bits are used for encoding 
the number of homogeneous groups filled with “0.” In the example, there 
are 174 such groups. The number of homogeneous groups is represented by 
the binary value equaled to 000000000000000000000010101110, stored on the 
remaining 30 bits. The last 31-bits group, denoted as group 176 , represents 
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the tail of the second run. The most significant bit in this group has value 
“0” denoting a tail. The remaining 31 bits are original bits of group 176. □ 

The compression techniques proposed in m and EH additionally apply the 
Huffman compression ESI to the run-length encoded bitmaps. The main dif- 
ferences between m and E2 are as follows. First, in m only some bits in 
a bit vector are of interest, the others, called ’don’t cares’ can be replaced 
either by zeros or ones, depending on the values of neighbor bits. In RLH all 
bits are of interest and have their exact values. Second, in m the lengths 
of homogeneous subvectors of bits are counted and become the symbols that 
are encoded by the Huffman compression. RLH uses run-length encoding for 
representing distances between bits having value 1. Next, the distances are 
encoded by the Huffman compression. 

In practice, the commercial systems (Oracle DBMS) and prototype sys- 
tems (FastBit) |7I )1 1711 apply bitmap compression techniques. Oracle uses 
the BBC compression whereas FastBit uses the WAH compression. 



Bitmap Indexes in Oracle 

The Oracle DBMS [t2| supports explicitly created bitmap indexes. A bitmap 
index is created by means of the below command. 

create bitmap index BIName on table (column) ; 

As mentioned in Section the size of a bitmap index increases with 

the increase of the cardinality of the indexed attribute. In order to reduce 
the size of such a bitmap index, Oracle DBMS applies a compression based 
on BBC. When the sparsity of bitmaps exceeds a given threshold, Oracle 
automatically compresses bitmaps. Figure 3(a) shows how the size of a bitmap 
index depends on the cardinality of the indexed attribute. The cardinality was 
parameterized from 4 to 65564 unique values. The experiment was conducted 
on OraclelOg R2. The indexed table included 320 000 000 of rows. As a 
reference, the size of the Oracle B*-tree is also included in the chart. 

As we can observe from chart |3(a)[ with the increase of cardinality from 
4 to 16 the size of the bitmap index increases. Next, for cardinality equal to 
64, the size of the BI decreases that is caused by compressing the BI. Next, 
with the increase of the cardinality above 64 we observe further increase in 
the size of the compressed BI. 

Bitmap indexes support fast computations of function count. Figure |3(b)| 
shows the computation efficiency of count (1) for various selectivities of a 
query. The query selected from 0.39% to 50% of rows based on the values of 
the indexed attribute. As a reference, the figure includes also the performance 
of the Oracle B*-tree index. The characteristics of the indexes were tested on 
identical attributes. It must be noticed that when count is replaced by other 
aggregate functions, like for example sum, avg, the bitmap index performs 
much worse, but still better than B*-tree. It is because, in order to compute 
these aggregate functions data must be fetched from disk. 
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attribute cardinality 



(a) The size of a bitmap index and a B*-tree 
index 




































0.39 0.78 1.56 3.13 6.25 12.5 25 50 

% of selected rows 



(b) The computation efficiency of count (1) with 
the support of a bitmap index and a B*-tree index 



Fig. 2.3. Comparison of some features of a bitmap index and a B*-tree index in 
Oracle (# of rows in an indexed table: 320 000 000) 



Bitmap Indexes in DB2 

IN the DB2 DBMS, queries with predicates on multiple indexed attributes 
can be optimized by means of the index AND-ing technique. In general, this 
technique is based on determining the intersection of the sets of ROWIDs, 
which are retrieved with the support of individual indexes on attributes used 
in query predicates. The intersection is determined with the support of bloom 
filters |I2I E) . The bloom filters transform each set of ROWIDs into a sep- 
arate temporal bitmap. Multiple bitmaps can then be further processed by 
AND-ing or OR-ing them, depending on query predicates. The bitmaps are 
called dynamic bitmap indexes. 
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As an example illustrating the application of a dynamic bitmap index 
to the optimization of a query on one table, let us consider table Sales 
with two B-tree indexes, one defined on attribute ProductID and one on 
CustomerlD. An example query with predicate ProductID in (’ThEl’, 
’SoVl 1 , ’DeVl’) and CustomerlD = 100 could be answered by applying 
the index AND-ing technique with the support of a dynamic bitmap index as 
follows. First, by using index on ProductID a bitmap is created that describes 
rows fulfilling the predicate ProductID in (’ThEl 1 , ’SoVl’, ’DeVl’).Let 
us denote the bitmap as Bp ro d . Similarly, by using index on CustomerlD 
a bitmap, say Bcust is created that describes rows fulfilling the predicate 
CustomerlD = 100. Second, the final bitmap is computed by AND-ing Bp ro d 
and Bcust • Based on the final bitmap that is transformed back to ROWIDs, 
rows fulfilling the predicates are retrieved. 

As an example illustrating the application of dynamic bitmap indexes to 
optimizing star queries, let us consider star query Q 1 from Section 12.3.31 
In order to use the technique, a B-tree index has to be created on each 
of the foreign keys. In the first step, every dimension table is semi-joined 
with the Sales table in order to determine the set of Sales rows that join 
with rows form each of the dimensions. Let the result of Sales k Products, 
i.e. , the set of ROWIDs fulfilling the predicate on ProdName is denoted as 
Sp ro d ■ Let the result Sales k Customers, i.e., the set of ROWIDs fulfilling 
the predicate on Town is denoted as Scust- Finally, let Sales k Time, i.e., 
the set of ROWIDs fulfilling the predicate on Year is denoted as Spime- 
In the second step, Sp ro d, Scust , and Spime are transformed to three bitmaps, 
denoted as Bp ro d, Bcust , and Bpi me , respectively. In the third step, bitmaps 
B Prod, Bcust , Bpi me are AND-ed. The final bitmap describes rows fulfilling 
the whole query predicate. It is then transformed back to ROWIDs. Based 
on the ROWIDs records are fetched from the Sales table. If some columns 
from dimension tables are needed in the query result, then the final set of 
sales rows is joined with appropriate rows from the dimensions. 



Bitmap Indexes in SQL Server 

Similarly as DB2, SQL Server ca does not support explicitly created bitmap 
indexes. Instead, it supports bitmap filters (this mechanism is available from 
version 2005). However, bitmap filters are not bitmap indexes. The bitmap 
filter represents (in a compact format) the set of values from one table being 
joined, typically a dimension table. Based on the bitmap filter, rows from the 
second joined table, typically a fact table, are filtered. Thus, the bitmap filter 
is applied as a semi-join reduction technique but only in parallel execution 
plans. A bitmap filter is automatically created by the SQL Server query op- 
timizer when the optimizer estimates that such a filter is selective. A bitmap 
filter is a main memory data structure. 

In order to illustrate the application of bitmap filters to the optimization 
of a star query, let us consider star query Q 1 from Section 12.3.31 It could be 
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processed as follows. In the first step, bitmap filters are created. Bitmap fil- 
ter BFp ro d is created, based on values (’ThinkPad Edge’, ’Sony Vaio’, ’Dell 
Vostro’) from the Products dimension table. In parallel, bitmap filters on di- 
mension tables Customers and Time are created, based on the predicates 
Town=’ London’ and Year=2009, respectively. Let us denote the filters as 
BFcust and BFy ear , respectively. In the second step, BFp ro d is used for semi- 
join reduction with the Sales table (let denote the intermediate result as tem- 
porary table Sales r p^ ed ). In the third step, bitmap filter BFc us t is applied 
as semi-join reduction operator to temporary table Sales r p d °d ed ■ Let us de- 
note its result as Sales r p d jj^ ust . In the fourth step bitmap filter BFy ear is 
applied to Sales r p d °dlc U sv resulting in temporary table Sales r P ed ^ ust _ Year . 
Finally, rows fulfilling the whole predicate of the query are fetched by means 
of Sales r p d ^Q ust _ Y ear . The order in which the bitmap filters are applied de- 
pends on the selectivity of the filters. The most selective one should be applied 
first. 

2.3.3 Bitmap Join Index 

The advantages of the join index and the bitmap index have been combined 
in a bitmap join index jZSI CM E2j- This index, conceptually is organized 
as the join index but the entries to the bitmap join index are organized as 
a lookup by the value of a dimension attribute. Each entry to the bitmap 
join index is composed of the ROWID of a row from a dimension table (or 
an attribute uniquely identifying a row in a dimension table) and a bitmap 
(possibly compressed) describing rows from a fact table that join with this 
value. Similarly as for the join index, the access to the bitmap join index 
lookup column can be organized by means of a B-tree or a hash index. 

In order to illustrate the idea behind the bitmap join index let us consider 
the example below. 

Example 4- Let us return to Example Q and let us define the bitmap join 
index on attribute ProductID of table Products. Conceptually, the entries of 
this index are shown in Table EH The lookup of the index is organized by the 
values of attribute ProductID. Assuming that the leftmost bit in each of the 
bitmaps represents the first row in the Sales table, one can see that the first 
and third sales row join with product identified by value 100, for example. 



Table 2.4. Example bitmap join index organized as a lookup by attribute Prod- 
ucts. ProductID 



Products. Product ID 


bitmap 


100 


101000 


230 


010001 


300 


000110 
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Every indexed value (e.g. 100) in dimension table Products has associated 
a bitmap describing records from fact table Sales that join with the dimension 
record. □ 

The bitmap join index takes the advantage of the join index since it allows to 
materialize a join of tables. It also takes the advantage of the bitmap index 
with respect to efficient AND, OR, NOT operations on bitmaps. 

Bitmap Join Indexes in Oracle 

Oracle also supports the bitmap join index that offers a very good perfor- 
mance of star queries. As an example let us consider the below star query 
Q 1, that joins the Sales fact table with all of its dimension tables. 

/* query Q1 */ 

select sum(SalesPrice) 

from Sales, Products, Customers, Time 

where Sales . Product ID=Products .Product ID 

and Sales . CustomerID=Customers . CustomerlD 

and Sales . TimeKey=Time .TimeKey 

and ProdName in (’ThinkPad Edge’, ’Sony Vaio’, ’Dell Vostro’) 
and Town=’ London’ 
and Year=2009; 

In order to reduce the execution time of Q 1, three bitmap indexes can be 
created, as shown below. 

create bitmap index BI_Pr_Sales 
on Sales (Products . ProdName) 
from Sales, Products 

where Sales . Product ID=Products .Product ID ; 

create bitmap index BI_Cu_Sales 
on Sales (Customers .Town) 
from Sales , Customers 

where Sales . CustomerID=Customers . CustomerlD ; 

create bitmap index BI_Ti_Sales 
on Sales (Time . Year) 
from Sales, Time 

where Sales . T imeKey=T ime . T imeKey ; 

The query execution plan, shown below, reveals how the bitmap join in- 
dexes are used. First, bitmaps for ’ThinkPad Edge’, ’Sony Vaio’, and ’Dell 
Vostro’ are retrieved and OR-ed (cf. lines 10-12), creating a temporary 
bitmap. Next, the bitmap for ’London’ is retrieved (cf. line 8) and it is AND- 
ed with the temporary bitmap (cf. line 7). The result bitmap is then converted 
to ROWIDs of records fulfilling the criteria (cf. line 6). In this query plan 
table Time was accessed with the support of a B-tree index defined on its 
primary key, rather than by means of the bitmap join index. 



42 



R. Wrembel 



lid 


1 Operation 


1 Name 


iRows I Bytes 


1 Cost (°/,CPU) I Time | 


1 0 | 


SELECT STATEMENT 


i 


i 


ii 


58 


1 13 


(8) 100:00:011 


1 1 1 


SORT AGGREGATE 


i 


i 


1 1 


58 


i 


i i 


1 2 I 


NESTED LOOPS 


i 


i 


211 


1218 


1 13 


(8) 100:00:011 


1 3 I 


HASH JOIN 


i 


i 


221 


1012 


1 12 


(9) 100:00:011 


1 4 I 


TABLE ACCESS FULL 


1 PRODUCTS | 


31 


51 


1 3 


(0) 100:00:011 


1 5 I 


TABLE ACCESS BY INDEX ROWID 


1 SALES 


i 


11551 


33495 


1 8 


(0) 100:00:011 


1 6 I 


BITMAP CONVERSION TO ROWIDS 


i 


i 


i 




i 


i i 


1 7 I 


BITMAP AND 


i 


i 


i 




i 


i i 


1 8 I 


BITMAP INDEX SINGLE VALUE 


1 BI_CU_ 


SALES I 


i 




i 


i i 


1 9 I 


BITMAP OR 


i 


i 


i 




i 


i i 


110 I 


BITMAP INDEX SINGLE VALUE|BI_PR_ 


SALES I 


i 




i 


i i 


111 1 


BITMAP INDEX SINGLE VALUE|BI_PR_ 


SALES I 


i 




i 


i i 


112 I 


BITMAP INDEX SINGLE VALUE|BI_PR_ 


SALES I 


i 




i 


i i 


113 I 


TABLE ACCESS BY INDEX ROWID 


ITIME 


i 


ii 


12 


1 1 


(0) 100:00:011 


114 I 


INDEX UNIQUE SCAN 


IPK.TIME | 


ii 




1 0 


(0) 100:00:011 



Alternatively, it is possible to create one bitmap join index on all of the di- 
mension attributes. The example below command creates the bitmap join in- 
dex on three attributes of the dimension tables, namely: Products. prodName, 
Customers. Town , and Time. Year. 

create bitmap index BI_Pr_Cu_Ti_Sales 

on Sales (Products . ProdName , Customers .Town, Time. Year) 

from Sales, Products, Customers, Time 

where Sales . Product ID=Products .Product ID 

and Sales . CustomerID=Customers . CustomerlD 

and Sales . TimeKey=Time .TimeKey; 

With the support of this index, the aforementioned star query Q1 is exe- 
cuted as shown below. As we can observe, its execution plan is much simpler 
than before. The bitmap join index is accessed only once (cf. line 5). 
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Unfortunately, this bitmap join index cannot be used for answering queries 
with predicates on one or two dimensions, like for example 

ProdName in (’ThinkPad Edge’, ’Sony Vaio’, ’Dell Vostro’) and Town= 
’London’ 

or 

Town= ’London’ and Year=2009 

or 



Year=2009 
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On the contrary, the three independent bitmap join indexes defined ear- 
lier, i.e. , BI_Pr_Sales , BPCuSales , BPTi-Sales offer much flexible indexing 
scheme as they can be used answering queries with predicates on some of the 
dimensions. 

2.4 Materialized Views and Query Rewriting 

Execution time of complex, time consuming star queries can be reduced by 
physically storing and re-using their previously computed results. This way, 
a precomputed result is perceived by a query optimizer as another source of 
data that can be queried. The precomputed result is commonly called a ma- 
terialized view (MV). If a user executes query Q that computes values that 
have already been stored in materialized view MVi , then a query optimizer 
will rewrite original query Q into another query Q' so that Q' is executed on 
MVi and Q' returns the same result as original query Q. This technique of 
automatic, in-background rewriting of users’ queries is called a query rewrit- 
ing. Notice that a user’s query must not necessarily be exactly the same as 
the query whose result was stored in a materialized view. In this case, a query 
optimizer may join a materialized view with other MVs or tables, may further 
aggregate and filter the content of a MV and may project columns of a MV, 
in order to answer a user’s query U In order to illustrate the idea behind 
the query rewriting mechanism let us consider the example below. 

Example 5. Let us return to the DW logical schema shown in Figure ETT1 a, nd 
let us assume that in this DW users execute the below queries. The first 
one computes quarterly sales of product categories in countries. The second 
one computes monthly and quarterly sales of products belonging to category 
’electronic’. 

select Category, Country, Quarter, 
sum(SalesPrice) as SumSales 
from Sales s, Products p, Customers c, Time t 
where s . Product ID=p . Product ID 
and s . CustomerID=c . Customer ID 
and s.TimeKey=t.TimeKey 
group by Category, Country, Quarter; 

select ProdName, Country, Month, 

sum(SalesPrice) as SumSales 
from Sales s, Products p, Customers c, Time t 
where s . Product ID=p . Product ID 
and s . CustomerID=c . Customer ID 
and s.TimeKey=t.TimeKey 
and p . Category=’ electronic ’ 
group by ProdName , Country , Month 

Both queries could be optimized by materialized view SalesMVl , shown 
below, whose query makes available monthly and quarterly sales values of 
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products and their categories for each country. Both of the above queries 
could be answered by means of the materialized view by further aggregating 
its content. 

create materialized view SalesMVl 

select ProdName, Category, Country, Month, Quarter, 
sum(SalesPrice) as SumSales 
from Sales s, Products p, Customers c, Time t 
where s . Product ID=p . Product ID 
and s . CustomerID=c . Customer ID 
and s . T imeKey=t . T imeKey 

group by ProdName, Category, Country, Month, Quarter □ 



As we can see from the example, the definition of a materialized view includes 
a query. Tables referenced in the query are called base tables. 

Although in the above example one materialized view is used for rewriting 
two queries, in practice, multiple materialized views need to be created in 
order to optimize a certain workload of queries. For this reason, a challeng- 
ing research issue concerns the selection of such a set of materialized views 
that: (1) will be used for optimizing the greatest possible number of the most 
expensive queries and (2) whose maintenance will not be costly. Several re- 
search works have addressed this problem and they have proposed multiple 
algorithms for selecting optimal sets of materialized views for a given query 
workload, e.g. CBI TDj SDl 81 . Commercial DB/DWMSs, like, Oracle, DB2, 
and SQL Server provide tools that analyze workloads and based on them, 
propose sets of database objects, typically materialized views and indexes, 
for optimizing the workloads. 

2.4.1 Materialized Views in Oracle 

In Oracle, a materialized view is created with the create materialized 
view command. Its definition includes: (1) the moment when the view is filled 
in with data, (2) its refreshing method (fast, complete, force), (3) whether the 
materialized view is refreshed automatically or on demand, (4) row identifi- 
cation method (required for incremental refreshing), (5) the view automatic 
refreshing interval, and (6) a query computing the content of the materialized 
view. 

A command creating an example materialized view YearlySalesMV is 
shown below. 

create materialized view YearlySalesMVl 

build immediate 

refresh force 

with rowid 

as 

select ProdName, Category, Quarter, Year, 



2 Data Warehouse Performance 



45 



sum(SalesPrice) as SumSales 
from Sales s, Products p, Time t 
where s . Product ID=p . Product ID 
and s.TimeKey=t.TimeKey 

group by ProdName, Category, Quarter, Year; 

The build immediate clause denotes that the view will be filled in with 
data during the execution of the command. The refresh force clause de- 
notes that the system automatically selects the refreshing mode (either fast, 
i.e. , incremental or complete). If the system is able to refresh the MV fast, 
then this method is used. Otherwise the complete refreshing is used. The with 
ROWID clause defines the method of identifying rows in the MV and its base 
tables. In this example, rows will be identified based on their physical ad- 
dresses (ROWIDs). Rows can also be identified based on their primary keys. 
To this end, the with primary key clause is applied. Row identification is 
required for incremental refreshing. 

For some versions of Oracle (e.g., lOg) clause with rowid clashes with 
clause enable query rewrite. The latter makes available a MV for query 
rewriting. Normally, this clause is part of the MV definition, placed between 
with rowid I primary key and as. In cases these clauses clash, one has to 
make the view available for query rewriting by executing additional command 
shown below. 

alter materialized view MVName enable query rewrite; 

Oracle materialized views will be used in query rewriting provided that: 
(1) the cost-based query optimizer is used, (2) materialized views have been 
enabled for query rewriting, (3) the system works in the query rewriting 
mode, and (4) a user executing a query has appropriate privileges. 

In order to illustrate the query rewriting process, let us consider the below 
query. We assume that MV Yearly SalesMVl was created and made avail- 
able for query rewriting. The query execution plan reveals that it was au- 
tomatically rewritten on the YearlySalesMV , cf. row number 2 in the below 
plan. This row denotes that the MV was sequentially scanned and its content 
grouped (cf. row number 1) in order to compute the more coarse aggregate. 

select ProdName, Year, sum(SalesPrice) as SumSales 

from Sales s, Products p, Time t 

where s . ProductID=p . ProductID 

and s.TimeKey=t.TimeKey 

and t.Year=2009 

group by ProdName, Year; 
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As mentioned before, a MV can be refreshed fast, provided two conditions 
are fulfilled. First, the MV query definition includes only the allowed con- 
structs. The set of these constructs is extended from version to version of the 
Oracle DBMS (cf. | 72 |). Second, each of the MV base tables has associated 
its own materialized view log. The log records DML operations applied to the 
base table. The content of the log is used for fast refreshing the MV. 

The materialized view log is created by the below command (only its basic 
form is shown). For a MV created with the with primary key clause the MV 
log must include the same clause. The same applies to the with rowid clause. 
In the first case the log will store the values of columns that constitute the 
primary key of the table, whereas in the second case it will store the values 
of ROWID. A MV log may include both clauses if it serves for both types of 
MV. Apart from the primary key and ROWID column, a materialized view 
log can store values of other columns whose changing values have impact on 
the content of the MV. The list of these columns is denoted as coll , . . . , 
coin. 

create materialized view log on BaseTable 
with primary key I ROWID (coll, .... coin), 

sequence 

including new values ; 

Including the sequence clause in a materialized view log definition results 
in the creation of a column SEQUENCE$$ in the log. Its values represent the 
order in which DML operations are executed on the base table for which the 
log was created. Oracle advises to create this column for fast refreshing after 
mixed insert, update, delete operations on the base table. The including 
new values clause forces Oracle to store in the MV log old values of columns 
listed in coll, . . . , coin as well as their new values. Omitting this clause 
results in storing only the old values of columns. This clause must be included 
in the MV log definition in order to support fast refreshing MVs computing 
aggregates. 

2.4.2 Materialized Views in DB2 

In DB2 a materialized view, called a materialized query table (MQT) 
or a summary table, is created with the create table command with 
additional clauses defining its maintenance. The clauses include: (1) 
maintained by {system I user}, (2) refresh {immediate I deferred}, 
(3) data initially {immediate I deferred}, and (4) enable query 
optimization. 

A MQT can be maintained either by a system or by a user. In the first 
case, the maintained by system clause is used. This is a default clause. For 
a system-maintained MQT, one can define either automatic or non-automatic 
refreshing mode. In the automatic mode, a MQT is refreshed automatically 
as the result of changes in the content of its base tables. To this end, the 
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definition of a MQT must include the refresh immediate clause. This re- 
freshing mode requires that a unique key from each base table is included 
in the select command defining the MQT. refresh deferred denotes that 
a MQT has to be explicitly refreshed by explicit execution of the refresh 
table command, with the syntax shown below. 

refresh table TableName {incremental I not incremental} 

The data initially immediate clause causes that the content of a MQT 
is computed as part of the command creating the MQT. As a result, as soon 
as the command is finished, the MQT is filled in with its materialized data. 
The data initially deferred clause denotes that the content of the MQT 
is not computed as part of the command creating the MQT. After being 
created, the MQT is in the check pending state. In this state, the MQT 
cannot be queried until the set integrity command has been executed on 
the MQT. 

The enable query optimization clause, similarly as in Oracle, makes a 
MQT available for query rewriting. It is the default clause. 

An example system-maintained MQT, called Yearly S ales MV2, is shown 
below. The MQT is filled in with data during its creation process (data 
initially immediate), is refreshed immediately after changes have been 
applied to its base tables (refresh immediate), and is made available for 
query optimization (enable query optimization). 

create table YearlySalesMV2 
as 

(select ProdID, ProdName, Year, sum(salesPrice) as SumSales 
from Sales s, Products p. Time t 
where s . Pr oduct ID=p . Product ID 
and s .TimeKey=t .TimeKey 
and t.Year=2009 

group by ProdID, ProdName, Year) 
data initially immediate 
refresh immediate 
maintained by system 
enable query optimization; 

In order to create a user-defined MQT, the maintained by user clause 
has to be included in the definition of a MQT. This type of a MQT can be 
refreshed only in the deferred mode and the refresh table command cannot 
be applied to such a MQT. Thus, a user is responsible for implementing the 
procedure of refreshing the MQT (by means of inserts, imports, triggers, etc.). 
An example definition of a user-maintained MQT is shown below. After being 
created, the MQT is in an inconsistent state and must be made consistent 
by executing the set integrity command. 

create table YearlySalesMV3 
as 

(select Year, sum(salesPrice) as SumSales 
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from Sales s, Time t 
where and s . TimeKey=t . TimeKey 
group by Year) 
data initially deferred 
refresh deferred 
maintained by user 

set integrity for YearlySalesMV3 materialized query 
immediate unchecked 

A MQT created with the refreshed deferred clause can be incremen- 
tally refreshed provided that a system maintains a log of changes that are 
used for refreshing the MQT. This log is called a staging table. A staging ta- 
ble is created by the create table command. After being created, a staging 
table is in an inconsistent state and must be made consistent by executing 
the set integrity command. The simple example below illustrates how to 
create a staging table for MQT Yearly SalesMV 3. 

create table YearlySalesMV3_ST for YearlySalesMV3 propagate immediate 
set integrity for YearlySalesMV3 staging immediate unchecked 



2.4.3 Materialized Views in SQL Server 

In SQL Server, a materialized view, called indexed, view , is created by creating 
a unique clustered index on a view (cf. |E3). The index causes that the 
view is materialized, i.e., the whole result set of a query defining a view 
is persistently stored in a database. The definition of an indexed view has 
to fulfill the following conditions. First, an indexed view requires a column 
whose value is unique. A unique index is then created on this column that 
results in clustering the data by this column. Second, an indexed view has to 
be created with the schemabinding clause that prevents from modifying the 
base tables of a view as long as the view exists. Third, all the view base tables 
must be referenced by schemaname . tablename. Fourth, the query defining 
a materialized view may not contain the following SQL constructs, among 
others: exists, not exists, count (*), min, max, top, union, outer join, 
non deterministic functions (e.g., GetDate), and subqueries. 

The below example illustrates commands creating indexed view 
YearlySalesMV2. The create view command defines the view with a query. 
The second command creates a unique cluster index on columns ProdID, 
ProdName , and Year. As a consequence, the whole result set of the query is 
materialized. 

create view YearlySalesMV2 
with schemabinding 
as 

select ProdID, ProdName, Year, sum(salesPrice) as SumSales 
from Sales s, Products p, Time t 
where s . Product ID=p . Product ID 
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and s.TimeKey=t.TimeKey 
and t.Year=2009 

group by ProdID, ProdName, Year 

create unique clustered index Indx_ProdID 
on YearlySalesMV(ProdID, ProdName, Year) 

Similarly as in Oracle and DB2, in the Developer and Enterprise editions 
of SQL Server, an indexed view can be used by a query optimizer for query 
rewriting. In other editions of SQL Server, in order to force a query optimizer 
to rewrite a query based on an indexed view, the query must explicitly ref- 
erence the indexed view and must include hint noexpand. In order to force 
a query optimizer to use base tables, rather than an indexed view, a query 
must include option expand views. Both of them are used in the below two 
query templates. 

select Columnl, Column2, ... 

from Table, IndexedView with (noexpand) 

where . . . 

select Columnl, Column2, ... 
from Table, IndexedView 
where . . . 

option (expand views) 

Indexed views, similarly as traditional indexes are automatically refreshed 
by a system. The refreshing mode is immediate and incremental, as for tra- 
ditional indexes. Notice that, non-clustered indexes can also be created on a 
materialized view in order to support quicker access to the view content. 

2.5 Partitioning 

Partitioning is a mechanism of dividing a table or index into smaller parts, 
called partitions. Due to space limitations, in this chapter we will focus only 
on table partitioning. 

The most benefit from partitioning is achieved if every partition is stored 
on a separate disc. This way, multiple partitions can be accessed in parallel 
without disc contest. 

There are two types of partitioning, namely horizontal and vertical. When 
a table, say T, is partitioned horizontally, its content is divided into dis- 
junctive subsets of rows. Every partition includes identical schema, that is 
the schema of T. When table T is partitioned vertically , it is divided into 
disjunctive subsets of columns (except a primary key that exists in every 
partition) and includes all rows from the initial table. 

Partitioning is guided by three following rules, namely completeness, dis- 
jointness, and reconstruction. Completeness states that when table T was 
partitioned into Pi, P 2 , ..., P n then every row from T or its fragment must 
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be stored in one of these partition. This criterion guarantees that after par- 
titioning no data will disappear. Disjointness states that when table T was 
partitioned into Pi, P 2 , ..., P n then every row or its fragment from T must 
be stored in exactly one partition. An exception to this rule is vertical parti- 
tioning where ever vertical partition stores a primary key of T. This criterion 
guarantees that partitioning does not introduce data redundancy. Reconstruc- 
tion states that there mus be a mechanism of reconstructing original table 
T from its partitions. In horizontal partitioning the reconstruction of T is 
done by unions of the partitions, whereas in vertical partitioning it is done 
by joining the partitions. 

In horizontal partitioning rows from an original table are placed in par- 
titions based on partitioning criteria. In this type of partitioning rows are 
divided into subsets based on the value of a selected attribute (or attributes) , 
called a partitioning attribute. With this respect, there are several techniques 
of partitioning rows based on a partitioning attribute. The first one is hash- 
based. In this technique, rows are placed in partitions based on a hash function 
that for each row takes as its argument the value of a partitioning attribute 
and returns the number of a partition where the row is to be stored. The 
second one is range-based. In this technique, every partition has defined the 
range of values of a partitioning attribute it can store. The third technique 
is value-based. In this technique, every partition has defined the set of values 
it can store. The fourth technique is based on the round robin algorithm. 

Inserting, updating, and deleting data from partitioned tables are man- 
aged by a system. If a row is inserted, it is a system that select the right 
partition where the row will be stored. If a row is deleted, then the system 
finds a right partition where the row was stored. If the value of a partition- 
ing attribute of an existing row is updated, then depending on a system and 
system parameters, a system may move an updated row from one to another 
partition. 

In this chapter we focus on horizontal partitioning, which is natively sup- 
ported by Oracle, DB2, and SQL Server. Vertical partitioning must be sim- 
ulated in these three DB/DWMSs. Typically, vertical partitioning of table 
T is simulated by creating n separate tables Tj, each of which contains the 
subset of columns of T. Additionally, each of Tj must contain a primary key 
column(s) used for the reconstruction of the original table T. The reconstruc- 
tion can be implemented by a view on top of tables 7j. 

Vertical partitioning is supported in commercial and open source systems 
that use column storage, e.g., Sybase IQ, EMC Greenplum, C-Store/Vertica, 
MonetDB, Sadas, FastBit, Model 204. 

2.5.1 Partitioning in Oracle 

Oracle supports multiple partitioning techniques, namely: range, interval, list, 
hash, virtual column, system, reference, and composite. In this section, we 
will briefly overview the partitioning techniques. 
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Range Partitioning 

In the range partitioning, each partition has defined its own ranges of val- 
ues it can store. The ranges are applicable to a partitioning attribute. For 
example, the below table Sales-Range-TKey includes five partitions named 
Sales-lQ-2009 , ..., Sales-Others. Partition SalesAQ-2009 accepts rows whose 
values of partitioning attribute TimeKey are lower than ’01-04-2009’. Parti- 
tion Sales-2Q-2009 accepts rows whose values of the partitioning attribute 
fulfill the condition ’01-04-2009’ < TimeKey < ’01-07-2009’, etc. Notice, that 
in range partitioning partitions must be ordered by the ranges. Partition 
Sales-Others is defined with the MAXVALUE keyword. It allows to store all the 
other records having the value of TimeKey greater or equal to ’01-01-2010’. 
Clause tablespace allows to point a tablespace where a partition is to be 
physically stored (an Oracle tablespace is a database object that allows to 
logically organize multiple files under one name). 

create table Sales_Range_TKey 

(ProductID varchar2(8) not null references Products (ProductID) , 

TimeKey date not null references time (TimeKey) , 

CustomerlD varchar2(10) not null references Customers (CustomerlD) , 
SalesPrice number (6, 2)) 

PARTITION by RANGE (TimeKey) 

(partition Sales_lQ_2009 

values less than (T0_DATE( ’01-04-2009’ , ’ DD-MM-YYYY ’ ) ) 
tablespace DataOl, 
partition Sales_2Q_2009 

values less than (T0_DATE( ’01-07-2009’ , ’DD-MM-YYYY’)) 
tablespace Data02, 
partition Sales_3Q_2009 

values less than (T0_DATE( ’01-10-2009’ , ’DD-MM-YYYY’)) 
tablespace Data03, 
partition Sales_4Q_2009 

values less than (T0_DATE( ’01-01-2010’ , ’DD-MM-YYYY’)) 
tablespace Data04, 
partition Sales_0thers 

values less than (MAXVALUE) tablespace Data05) ; 

Interval Partitioning 

A special type of a range partition is an interval partition whose partitioning 
attribute is typically of type date. The difference between these two partitions 
is that interval partitions are automatically created by a system when needed. 
If data not fitting into existing partitions are to be inserted, then appropriate 
new partitions are created to store the data. Interval partitions are created 
by using the interval keyword followed by the definition of the interval. 
The interval is defined by means of a system function NumToYMInterval 
whose fist argument is the value (length) of an interval and the second one 
is its measurement unit. A fragment of an SQL command defining interval 
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partitioning is shown below. In this example, an interval is equal to three 
months. 



PARTITION by RANGE (TimeKey) 

INTERVAL (NumToYMInterval (3 , ’MONTH’)) 

(partition Sales_lQ_2009 

values less than (T0_DATE( ’ 01-04-2009 ’ , ’ DD-MM-YYYY ’ ) ) , 
partition Sales_2Q_2009 

values less than (T0_DATE( ’ 01-07-2009 ’ , ’DD-MM-YYYY’))); 



List Partitioning 

In the list partitioning each partition has assigned the set of values 
of a partitioning attribute that the partition accepts. The example be- 
low table Sales-List-PayType is divided into three partitions. Partition 
Sales-Credit_Debit stores sales record paid with a credit card (’Cr’) or a 
debit card (’De’). Partition Sales-Cash stores sales records paid with cash. 
The last partition stores records having the value of PaymentType other than 
the three aforementioned. To this end, the DEFAULT keyword is used. 

create table Sales_List_PayType 

(ProductID varchar2(8) not null references Products (ProductID) , 

TimeKey date not null references time (TimeKey) , 

CustomerlD varchar2(10) not null references Customers (CustomerlD) , 
SalesPrice number (6, 2), 

PaymentType varchar(2)) 

PARTITION by LIST (PaymentType) 

(partition Sales_Credit_Debit values (’Cr’.’De’) tablespace DataOl, 
partition Sales_Cash values (’Ca’) tablespace Data02, 
partition Sales_0thers values (DEFAULT) tablespace Data05 

); 

Hash Partitioning 

In the hash partitioning, data are placed in partitions by an internal Oracle 
hash function. As an example, let us consider table Sales-Hash-CustID that 
is composed of two partitions. Both of them have names assigned by a system 
and are stored in a default tablespace. 

create table Sales_Hash_CustID 

(ProductID varchar2(8) not null references Products (ProductID) , 

TimeKey date not null references time (TimeKey) , 

CustomerlD varchar2(10) not null references Customers (CustomerlD) , 
SalesPrice number (6, 2), 

PaymentType char(l)) 

PARTITION by HASH (CustomerlD) partitions 2; 
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A user can explicitly assign names and storage locations for hash partitions. 
To this end, the PARTITION by clause has to be modified, as shown below. 

PARTITION by HASH (CustomerlD) 

(partition Custl tablespace DataOl, 
partition Cust2 tablespace Data02)); 



Virtual Column Partitioning 

Virtual column partitioning (available from Oraclellg) requires a virtual col- 
umn in the definition of a table. A virtual column is a column whose value 
is computed either by a formula or a stored deterministic function (key- 
word deterministic in a function definition). Next, this column is used as 
a partitioning attribute, but when a virtual column is used as a partitioning 
attribute its values cannot be returned by a stored function. Typically, this 
type of partitioning is applicable either to range or list partitioning. 

As an example, we show below a fragment of the command creating par- 
titions based on virtual column Gross. 

create table Products_Virtl 

(. . . 

SellPrice number (6, 2) , 

Tax number(4,2), 

Gross as (SellPrice*Tax) ) 

PARTITION by range (Gross) 

(partition Prodl values less than (1000), 
partition Prod2 values less than (2000)); 

Queries that use in their predicates either virtual column Gross or formula 
SellPrice*Tax can profit from the above partitioned table. For example, the 
simple query below 

select * from Products_Virtl 
where SellPrice*Tax<1000 

will be answered by accessing only partition Prodl. 



System Partitioning 

In a system partitioning (available from Oraclellg), partitions do not have 
assigned any constraints and any row can be inserted into any partition. In 
this case, the DBMS does not control the placement of rows, i.e. , it is a user 
(or application logic) that is responsible for inserting rows into the required 
partitions. In the system partitioning, every insert command must explicitly 
include the name of a partition where a row is to be inserted. A fragment of 
a command defining a system partitioned table is shown below. 

create table Customers_Sys 

(CustomerlD varchar2(10) CONSTRAINT pk_Customers PRIMARY KEY, 
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. . .) 

PARTITION by SYSTEM 

(partition Cust_Europe, partition Cust_America, partition Cust_Asia) ; 



Reference Partitioning 

A reference partitioning (available from Oraclellg) is applicable to partition- 
ing tables related to each other by primary key - foreign key relationships. A 
table with a primary key has defined explicitly a partition schema whereas 
a table with a foreign key inherits partitioning attribute and schema from 
the parent table. As an example illustrating this type of partitioning let us 
consider range partitioned table Products-List-Cat , as shown below. 

create table Products_List_Cat 
(ProductID varchar2(8) PRIMARY KEY, 

ProdName varchar2(30) , 

Category varchar2(15) , 

SellPrice number (6,2), 

Manufacturer varchar2(20) ) 

PARTITION by LIST(Category) 

(partition Prod_Elect values (’electronic’), 
partition Prod_Clo values (’clothes’)); 

Table Sales-List-Cat will inherit partitioning attribute and partition defi- 
nitions from Products-List-Cat. 

create table Sales_List_PayType 
(ProductID varchar2(8) not null 

constraint ProdID_FK references Products_List_Cat (ProductID) , 

. . .) 

PARTITION by REFERENCE (ProdID.FK) ; 

When tables are partitioned by reference, a query optimizer joins the tables 
by the partition wise join. In this join, only those partitions are joined that 
produce non empty set. 



Composite Partitioning 

Composite partitioning allows to divide main partitions into subpartitions. In 
Oraclellg main partitions can be either range or list, whereas subpartitions 
can be range, list, or hash. 

create table Sales_Comp_RH 

(ProductID varchar2(8) not null references Products (ProductID) , 

TimeKey date not null references time (TimeKey) , 

CustomerlD varchar2(10) not null references Customers (CustomerlD) , 
SalesPrice number (6, 2)) 

PARTITION by RANGE (TimeKey) 

SUBPARTITION by HASH (ProductID) subpartitions 2 
(partition Sales_lQ_2009 
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values less than (T0_DATE( ’ 01-04-2009’ , 
tablespace Users, 
partition Sales_2Q_2009 

values less than (T0_DATE( ’ 01-07-2009 ’ , 
tablespace Users, 
partition Sales_3Q_2009 

values less than (T0_DATE( ’ 01-10-2009’ , 
tablespace Users, 
partition Sales_4Q_2009 

values less than (T0_DATE( ’ 01-01-2010 ’ , 
tablespace Users) ; 



’ DD-MM-YYYY ’ ) ) 
’ DD-MM-YYYY ’ ) ) 
’DD-MM-YYYY’)) 
’DD-MM-YYYY’)) 



DDL and Select on Partitioned Tables 

When the value of a partitioning attribute of a given row is updated and 
this row no longer qualifies for its partition then the row can be automati- 
cally migrated into another partition. Automatic migration is available when 
for a partitioned table the alter table TableName enable row movement 
command is executed. Otherwise, such an update will not be possible. 

Partitions can be explicitly addressed by queries, delete commands, and in- 
sert commands (the latter is possible only for system partitions). To this end, 
the table name must be followed by keyword partition(PartitionName) . 
If however, a query does not address a partition explicitly, a query optimizer, 
based on partition definitions and query predicates, will optimize the query 
and will address only these partitions that contribute to the query result. 

2.5.2 Partitioning in DB2 

DB2 supports range partitioning of tables and indexes. The mechanism is 
similar to the one described for Oracle. An example fragment of a command 
creating a range-partitioned table is shown below. For every partition, the 
range of values of a partitioning attribute (i.e. , TimeKey ) is defined, similarly 
as in Oracle. The below table is composed of four partitions, each of which 
stores sales from one quarter of year 2009. 

create table Sales_Range_TKey 
(iProductID varchar2(8) , ...) 

PARTITION BY RANGE (TimeKey) 

(partition Sales_lQ_2009 starting ’01-01-2009’, 
partition Sales_2Q_2009 starting ’01-04-2009’, 
partition Sales_3Q_2009 starting ’01-07-2009’, 

partition Sales_4Q_2009 starting ’01-10-2009’ ending ’31-12-2009’) 

2.5.3 Partitioning in SQL Server 

SQL Server (from version 2005) provides horizontal range partitioning, sim- 
ilar to Oracle and DB2. Partitioning applies to tables in indexes. 
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Partitioning is defined with the support of two database object, namely a 
partition function and a partition scheme. The partition function defines the 
number of partitions for a table and ranges of values for every partition. The 
partition scheme defines storage locations for table partitions. The definition 
of a partition scheme is based on the partition function. 

An example partition function Sales-Range-TKey , defined for attribute of 
type datetime, is shown below. Applying this function to partitioning a table 
will result in five partitions having the following ranges of dates: date<’2009- 
04-01’, ’2009-04-01’< date <’2009-07-01’, ’2009-07-01’< date <’2009-10-01’, 
’2009-10-01’ < date <’2010-01-01’, and date>’2010-01-01’. The range right 
or range left keywords define the policy of inclusion of border values. If 
the range left keyword is specified then the ranges for partitions would be 
defined as follows: date<’2009-04-01’, ’2009-04-01’< date < ’2009-07-01’, ..., 
and date>’2010-01-01’. 

create PARTITION FUNCTION [Sales_Range_TKey] (datetime) 
as RANGE right for values 

(’20090401’, ’20090701’, ’20091001’, ’20100101’); 

A partition function may be used also for numeric and character columns 
in the same way as illustrated above. 

An example partition scheme PS-Sales-Range-TKey based on the 
Sales-Range-TKey partition function is shown below. Each of the five par- 
titions created by the partition function is placed in a separate filegroup, 
whose name is given in the to clause. Partition storing range date<’2009-04- 
01’ is stored in file group DataOl, partition storing range ’2009-04-01’< date 
<’2009-07-01’ is stored in Data02 , etc. A SQL Server filegroup is a database 
object offering the mechanism similar to a tablespace in Oracle, i.e. , it allows 
to combine multiple files under a given name. 

create PARTITION SCHEME PS_Sales_Range_TKey 

as partition Sales_Range_TKey 

to (DataOl, Data02, Data03, Data04, Data05) ; 

Finally, based on the defined partitioning scheme, a partitioned table can 
be created, as shown below. 

create table Sales_Range_TKey 
(ProductID varchar(8) , 

TimeKey datetime ...) 
on PS_Sales_Range_TKey (TimeKey) 



2.6 Summary 

Multiple research and technological works in the area of data warehouses have 
been focusing on providing means for increasing the performance of a data 
warehouse for analytical queries and other types of data processing. As men- 
tioned already, DW performance depends on multiple components of a DW 
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architecture that include among others: hardware and computational archi- 
tectures, physical storage schemes of data, query optimization and execution, 
dedicated data structures supporting faster data access. 

In this chapter we focused on just one component from the aforementioned 
list, i.e. , on data structures. The most popular data structures used in practice 
in major commercial DWMSs include: various index structures, materialized 
views, as well as partitioning of tables and indexes. In this chapter, first we 
discussed basic index structures, including a bitmap index, a join index, and a 
bitmap join index. We outlined the functionality of explicitly created bitmap 
indexes and bitmap join indexes in Oracle and we showed how they are 
applied in a query execution process. We also showed how system-managed 
bitmap indexes are applied to star query executions in DB2 and SQL Server. 
Second, we discussed the concept of materialized views and their application 
to query rewriting. We analyzed the functionality of materialized views in 
Oracle, DB2, and SQL Server. We showed how star queries are rewritten 
based on Oracle materialized views. Third, we discussed table partitioning 
techniques in Oracle, DB2, and SQL Server. 

From the research and technological point of view, open issues in the area 
of DW performance include among others: building DWs in a parallel compu- 
tation environments (cloud, grid, clusters, GPUs), main memory DWs, effi- 
cient data storage schemes (the column store and storage in MOLAP servers), 
query optimization and processing (especially in parallel computation envi- 
ronments and in main memory architectures), novel data structures support- 
ing faster access to data, data and index compression techniques, testing and 
assessing a DW performance. 

New business domains of DW application require more advanced DW func- 
tionalities, often combining the transactional and analytical features m- For 
example, monitoring unauthorized credit card usage, monitoring telecommu- 
nication networks and predicting their failures, monitoring car traffic, ana- 
lyzing and predicting share rates, require accurate and up to date analytical 
reports. In order to fulfill this demand, one has to assure that the content 
of a DW is synchronized with the content of data sources with a minimum 
delay, e.g., seconds or minutes, rather than hours. Moreover, queries have to 
be answered instantly analyzing new data that were just loaded into a DW. 
To this end, the technology of a real-time (near real-time, right-time) data 
warehouse (RTDW) has been developed, e.g., (HH1E1- Strong demand for up 
to date data at any moment opens a new areas of research on assuring high 
performance of RTDWs. 
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Summary. The aim of this lecture is to present how popular user-centric tech- 
niques, namely personalisation and recommendation, can be adapted to an OLAP 
context. The presentation begins with an overview of query personalisation and 
query recommendation in relational databases. Then it introduces the approaches 
proposed for personalising OLAP queries with user preferences, and the approaches 
proposed for recommending OLAP queries. All the approaches are characterized in 
terms of formulation effort, prescriptiveness, proactiveness, expressiveness, and in 
terms of the data leveraged: the current state of the database, its history, or external 
information. 



Keywords: OLAP queries, preferences, query personalisation, collaborative 
filtering, recommender systems, query recommendation. 



3.1 Introduction 

According to a recent article of The Economist, mankind created 150 exabytes 
(billion gigabytes) of data in 2005. In 2010, it will create 1,200 exabytes 
p. With such amounts of data, it is of paramount importance to be able 
to access relevant information efficiently, using sophisticated search tools. 
On the other hand, systems offering search facilities, like DBMSs, are quite 
uneasy to use, driving querying or navigation into huge amount of data a 
very tedious process. Those such facilities should be more user-friendly j2|. 

In domains like the Web, Information Retrieval or e-commerce, user-centric 
approaches like personalisation or recommendation have been proved success- 
ful (see e.g., j^j). It is for instance believed that Amazon makes around 30% 
of sales thanks to recommendations. Such approaches are very relevant in a 
database context. For instance, the user may not accept to spend too much 
time conceiving the query. In addition, she may not be happy if the query’s 
answer shows too many or too few results. And, even if the size of the answer 
is acceptable, she may be relieved to see the system automatically suggesting 
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queries that will display other answers of interest, especially if she is left with 
the task of navigating the database to analyse the data it contains. 

A typical example of such an analysis is that of a data warehouse navigated 
by decision makers using OLAP queries 0 . A data warehouse can be seen as a 
large database with a particular topology, shared by many analysts who have 
various interests and viewpoints, where data is seen as a cube, explored by 
sequences of OLAP queries that may return large answers. In such a context, 
being able to personalize or recommend queries is seen as particularly relevant 

M- 

This lecture introduces how personalisation and recommendation ap- 
proaches have been adopted by the database community, and how they can 
be adapted to these particular databases that are data warehouses. More 
precisely, it will try to answer the following two questions: 

• Given a database query q , how to to cope with too many/too few results? 
Personalisation can be used to answer this question. If the query result is 
too large then being able to add preferences to this query gives a way of 
ranking the query results to focus on the most relevant first. On the other 
hand, if the query result is too small, then selection predicates, also called 
strong constraints, could be turned into preferences (or soft constraints) 
to weaken this query. 

• Given a sequence of queries over a database, how to suggest queries to 
pursue the session? In this case, what the user did in the past, or al- 
ternatively, what similar users did in the past, can serve as a basis for 
recommending relevant queries to complement the current query answer. 

The lecture is organised as follows. The next section presents basic def- 
initions and concepts of preferences and recommender systems. Section 1.4. .41 
delimits the scope of this lecture and precise how the approaches surveyed will 
be categorised. Section a, nd Section 14.51 introduce the existing approaches 

in relational databases and multidimensional databases, respectively. Section 
rm concludes the lecture with a brief discussion and presents some open 
issues. 

Note that this lecture assumes basic knowledge of relational database jjjjj 
and data warehousing 0. Part of the material that focuses on query rec- 
ommendation is borrowed from j5). The slides illustrating this lecture are 
available on the author’s web page |2j. 

3.2 Preliminaries 

In this section, we provide the basic definitions underlying personalisation 
and recommendation. 
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3.2.1 Preference Expression 

We begin by explaining the basics of preference modelling. A more compre- 
hensive introduction can be found in Ell- 

Qualitative and Quantitative Preferences 

Two types of approaches are used to express preferences. Qualitative ap- 
proaches express relative preferences i.e. , ”1 like a better than 6”. Such a 
preference is noted a > b where > is usually a Strict Partial Order (SPO). 
An SPO is a binary relation > over a set O which is 

• Irreflexive, i.e., for all a £ O, ->(a > a) 

• Asymmetric, i.e., for all a,b £ O, if (a ^ b) and (a > b) then -<(b > a) 

• Transitive, i.e., for all a,b,c £ O , if (a > b) and ( b > c) then (a > c) 

Given a preference relation >, the indifference relation ~ is defined by: 
(a ~ b) if -i(a > b) and —<(b > a). It expresses that a and b are not comparable. 
Particular partial orders of interest are Total Orders (TO) and Weak Orders 
(WO). A relation > is a TO if for every a and b, either (a > b) or (b > a). > 
is a WO if > is a SPO and ~ is transitive. 

Example 1. Consider the following database instance: 

Movies Author Genre Price Duration 



tl 


Cohen 


Comedy 


5 


90 


t2 


Cohen 


Comedy 


6 


100 


t3 


Cohen 


Comedy 


7 


80 


t4 


Allen 


Drama 


7 


120 


t5 


Lynch 


Drama 


5 


150 



The preference ” I prefer Lynch movies over Allen movies and Allen movies 
over Cohen movies” entails that tuple t5 is preferred to tuple t4 and tuple t4 
is preferred to tuples tl, t2, t3. As preferences are assumed to be transitive, 
we say that t5 dominates all the other tuples. Note that this preference says 
nothing e.g., for tl and t2, neither for tl and t3. 

Quantitative approaches express absolute preferences, i.e., I (do not) like a 
to a specific degree. They are based on Scoring / Utility Functions. I like a 
better than b is noted u(a) > u(b) where u is a scoring function. 

Quantitative approaches are often said to be less general than qualita- 
tive approaches in the sense that, in order to be representable using scoring 
functions, a preference relation has to be a WO, which implies that the cor- 
responding indifference relation has to be transitive. But on the other hand, 
only scoring functions can express the intensity of preferences. 

Example 2. The preference ”1 prefer Lynch movies over Allen movies and 
Allen movies over Cohen movies” can be expressed by the following scoring 
function (assuming that scores range from 0 to 1): 
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• ”1 like Lynch” corresponds to a score of 0.9 

• ”1 like Allen” corresponds to a score of 0.8 

• ”1 like Cohen” corresponds to a score of 0.5 

It can easily be seen that for instance, preference ”1 prefer cheaper movies, 
given that author and genre are the same” (i.e., tl is preferred to both t2 
and t3, but it is not preferred to t4 or t5) cannot be expressed with scoring 
functions. 



Preference Composition 

Preferences can be defined extensionally under the form relation instances, 
or intentionally. In the latter case, the intentional definition is called a model 
of preference. Models of preferences can be expressed with a given language 
like those of |HJ H2J and/or by using composition of preference relations. 
Compositions follow the approach used to express preferences and thus can 
be qualitative or quantitative. 



Qualitative Composition 

In what follows, let T be a set of tuples and >i and >2 be two prefer- 
ence relations over T. We restrict here to single dimensional composition, 
where preferences are expressed over a single relation. Common single dimen- 
sional composition includes Boolean Composition, Prioritized Composition, 
and Pareto Composition. 

Boolean composition involves a boolean operator, for instance: 

• Intersection, i.e., R = (>1 D >2) with (t R if) if (t >1 if) and (t >2 t') 

• Union, i.e., R = (>1 U >2) with (t R t') if (t >1 if) or (t > 2 if) 

Prioritized Composition imposes a priority of a preference over another. It 
is formally defined by R = (>1 < >2) with (t R if) if (t >1 if) or ( ->(f ' >1 t) 
and (t >2 t')) 

Pareto Composition assumes two preferences to be equally important. It 
is formally defined by: R = (>1 (g> >2) with (t R if) if ((t >1 if) and (t >2 if 
or t ~ 2 t')) or ((f > 2 t') and (t >1 t' or t ~i t')). 

Example 3. Consider the two preferences: ”1 prefer Lynch movies over Allen 
movies and Allen movies over Cohen movies”, called PI and ”1 also prefer 
shorter movies” called P 2. Composing them using intersection result in a 
particular SPO with no domination, that can be interpreted as everything 
is preferred since all tuples are undominated. Composing them with union 
violates irreflexivity and asymmetry, and thus the resulting relation is usu- 
ally considered as not being a preference relation. Indeed, in this case, the 
resulting relation would mean for instance that both t5 should be preferred 
to t4 (according to PI) and t4 should be preferred to t5 (according to P 2). 
Composing them with prioritisation results in a total order reflecting PI 
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first and then P 2 only when P 2 does not contradict PI. More precisely, we 
have t5 preferred to t4 preferred to t3 preferred to tl preferred to t2. Fi- 
nally, composing them using Pareto results in a preference relation where 
only t3 dominates tl and tl dominates t2, and neither t4 dominates t5 nor 
t5 dominates t4 since PI and P 2 do not agree for these two tuples. 

Note that properties preservation (irreflexivity, etc.) under various kind of 
composition operators has been deeply studied (see PH for more details). 

Quantitative Composition 

Quantitative composition is generally achieved through dedicated functions, 
like weighted functions, min, max, etc. An example of quantitative composi- 
tion is, given preferences PI modelled with score p\ and preference P 2 mod- 
elled with scorep 2 : Score /(pi,p 2 )(i*) = xx scorep\{ti) + (1 — x) x scorep 2 {ti ) 
where x is some weight. 

Example 4- Consider the following preferences: ”1 prefer Lynch movies over 
Allen movies and Allen movies over Cohen movies” (PI) expressed by: 

• ”1 like Lynch” with scorepi = 0.9 

• ”1 like Allen” with scorepi = 0.8 

• ”1 like Cohen” with scorepi = 0.5 

and ”1 also prefer shorter movies” (P2) expressed by: 

• ”1 like (duration=80)” with scorep 2 = 1 

• ”1 like (duration=90)” with scorep 2 = 0.9 

• ”1 like (duration=150)” with scorep 2 = 0.6 

Suppose we use the scoring function defined above to compose PI and P2, 
with x = 0.5. Then, for instance, the score of tuple tl would be: 0.5 x 0.5 + 
0.5 x 0.9 = 0.7 

3.2.2 Recommender Systems 

In this section, we briefly introduce the basics of recommender systems (see 
m for a more substantial presentation). 

Basic Formulation 

A recommender system is typically modelled as follows. Let I be a set of 
items (e.g., products in a typical e-commerce application) and U be a set of 
users (e.g., customers in a typical e-commerce application). Let / be an utility 
function with signature U x I — > R for some totally ordered set R (typically 
reals between 0 and 1). Recommending s 1 to u is to choose for the user u the 
item s 1 that maximizes the user’s utility, i.e., s' = argmaxj f(u, i). The func- 
tion / can be represented as a matrix M = U x /, that records for any user 
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u in U, any item i in /, the utility of i for u, that is f(u, i). The problem of 
recommending items to users is that this matrix is both very large and very 
sparse. Thus, many methods have been proposed for estimating the missing 
ratings. Moreover, achieving relevant user-specific recommendations is par- 
ticularly difficult since it has been observed that everyone is a bit eccentric 

Cl- 
in general, recommendation methods are categorized m into: (i) Content- 
based, that recommend items to the user u similar to previous items highly 
rated by u, (ii) Collaborative, that consider users similar (i.e. having similar 
profiles) to the one for which recommendations are to be computed as a basis 
for estimating its ratings and (iii) Hybrid, that combine content-based and 
collaborative ones. Note that d propose a multidimensional generalisation 
of this basic two-dimensional formulation, especially to support profiling and 
contextualisation . 



Content-Based Recommendation 

Typical content-based recommendation is based on the comparison between 
item profiles and user profiles. For instance, it can rely on the following 
principle: 

1. Build item profiles by using selected features and providing a score for 
each feature. 

2. Build user profiles from highly rated item profiles, typically by computing 
a weighted average of item profiles. 

3. Compare user profiles with non-rated item profiles to estimate the missing 
ratings. Typical similarity measures include vector-based similarities like 
cosine. 

4. Recommend to the user those non-rated items achieving the best simi- 
larity scores. 

Example 5. Consider the following matrix recording ratings: 



Donuts Duff Apple Tofu Water Bud Ribs 



Homer 


0.9 


0.8 






0.7 


Marge 






0.8 




0.6 


Bart 


0.7 


0.6 


0.1 




0.8 


Lisa 


0.2 






0.8 


0.6 


Maggie 


0.6 






0.5 


0.6 



Suppose that the features chosen for the profiles are: (contains sugar, ok for 
a diet) . Item profiles are modelled as vectors recording a score for these two 
features. Suppose here that the scores are automatically computed from the 
items’ nutrition facts. For instance, the profile for Donuts is (0.9,0) and the 
profile for Apple is (0.4, 0.6). Then user profiles are also modelled as vectors 
recording scores for the same two features, derived from the known ratings. 
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For instance, Homer profile would be: (0.9 x (0.9, 0) + 0.8 x (0.6, 0.1) + 0.7 x 
(0.6,0.1))/3 = (0.8, 0.1)0. Lisa profile would be: (0.3, 0.8). The similarity 
score for the user profile with the item profile estimates the missing ratings. 

The limitations of content-based approaches are the following: First finding 
a good set of features must be done very carefully since it impacts directly 
the score estimates and hence the quality of the recommendation. Another 
problem is that recommendations stick to the user profile. For instance, using 
the example above, Homer will never be recommended Tofu. Finally, this 
approach suffers from the cold-start problem, i.e., how to build a profile for 
a new user for who no ratings are known. 

Collaborative Recommendation 

The main idea of collaborative approaches is to benefit from all users’ ratings 
when estimating a user’s missing ratings. Depending on how the matrix is 
used (row- wise or column- wise), two techniques are distinguished, that are 
based on computing similarities among users or items: 

• User-user collaborative approaches estimate the ratings for items based 
on ratings of similar users. 

• Item-item collaborative approaches estimate the ratings for items based 
on ratings for similar items. 

Example 6. To illustrate the user-user approach, suppose that all user are 
modelled as vectors having as many components as there exists items, the 
value of the component being the rating for the item, or 0 if the rating is 
not known. For instance, Homer would be modelled as the following vector: 
(0.9, 0.8, 0,0, 0,0. 7,0). Similarity is computed between users, with cosine for 
instance, to find the users who are the most similar to the one for which 
the ratings are to be estimated. These users’ ratings are derived to estimate 
the user’s missing ratings. For instance, suppose that Bart is found the most 
similar to Homer, with a similarity score of 0.8. Then, given that Bart has a 
score for Ribs and Homer has not, Bart’s score is used to estimate Homer’s, 
by weighting Bart’s score with the similarity between Bart and Homer, i.e., 
0.8 x 0.8 is our example. 

The limitations of the collaborative approach are that it relies on heavy pre- 
computation, and that new users or new items, for which no ratings are 
known, cannot be taken into account. 

Hybrid Methods 

Hybrid approaches are used to cope with the limitations of both previous 
approaches. Such approaches include the aggregation of a content-based 

This profile can be interpreted as: Homer contains sugar and is not ok for a diet. 
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computed score with a collaborative computed score, the addition of content- 
based to collaborative filtering, or the use of item profiles to cope with the 
new item problem. 



3.3 Categorising the Approaches 

In this section we define precisely the scope of the lecture and present the 
criteria used to categorise the approaches. 

3.3.1 Scope of the Lecture 

There is a lot of works in the database community that deal with query 
transformation, i.e. , the process of, given a database query q 1 transforming 
q into another query q' . Among these work, we restrict our lecture to the 
following transformations: 

• Query personalisation: given a database query q and some profile, com- 
pute a query q' C q that has an added value w.r.t. the profile. 

• Query recommendation: given a database query q and some profile, com- 
pute a query q' such that neither q' C q nor q C q' , that has an added 
value w.r.t. the profile. Note that computing a query q' that include q 
would correspond to query relaxation (see e.g., (13 d). 

Note that other forms of query transformation like e.g., relaxation, non 
relational data types (XML, etc.), and implementation and evaluation issues 
will not be covered. 

3.3.2 Categorisation of the Approaches 

To describe and categorise the approaches presented in this lecture, we adopt 
the criteria introduced in PI, that are mostly used to differentiate person- 
alisation approaches. 

• Formulation effort: some approaches require the user to manually specify 
profile elements for each query, while in others the best they are inferred 
from the context and the user past actions. 

• Prescriptiveness: some approaches use profile elements as hard constraints 
that are added to a query while in other as soft ones: tuples that satisfy 
as much profile criteria as possible are returned even if no tuples satisfies 
all of them. 

• Proactiveness: distinguishes the approaches that suggest new queries 
based on the navigation log and on the context (but that does not exe- 
cute them), with respect to those that change the current query or post 
process its results before returning them to the user. 

• Expressiveness: personalization criteria have different expressivenesses 
and can be differently combined. 
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To precisely distinguish between the type of data needed, especially for rec- 
ommendation techniques, we also use the taxonomy proposed in There, 

three categories are identified: 

• Current-state approaches, exploiting the content and schema of the cur- 
rent query result and database instance. Current-state approaches can be 
based either on (i) the local analysis of the properties of the result of the 
posed query or (ii) the global analysis of the properties of the database. 
In both cases systems exploit (i) the content and/or (ii) the schema of 
the query result or the database. 

• History-based approaches, using the query logs. 

• External sources approaches, i.e., approaches exploiting resources external 
to the database. 

3.4 Query Personalisation and Recommendation in 
Relational Databases 

In this section, we give a brief overview of the approaches developed in the 

relational database context. 

3.4.1 Personalisation in Databases 

We can distinguish two types of approaches: 

• The use of explicit preference operators in queries. The most representa- 
tive works in this category include Winnow H3, Preference SQL m and 
Skyline J2J1 This type of approaches requires high formulation effort, is 
not prescriptive not proactive, but is highly expressive. 

• The rewriting (expansion) of regular database queries based on a profile. 
The most representative work is initiated in EH- This approach requires 
a low formulation effort, is prescriptive and not proactive and has low 
expressiveness. 



Use of Dedicated Operators 

The basic definition of the operator computing dominating tuples is the fol- 
lowing. Given a relation r with schema sch{r) and a preference C over sch(r) 
defining a preference relation >c, the Winnow operator m, denoted wc, is 
defined by: wc(r ) = {i € r|(/lf' £ r)(t' >c t)}. 

This operator can be used to order the query answer. Indeed, the answer 
to a query q can be partitioned according to C, i.e., q = wc(q) U wc(q — 
lOciq)) U . . . meaning that the answer can be presented by displaying wc(q) 
first, then wc(q — wc(q)), etc. 
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Example 7. Suppose that preference C is ”1 prefer drama”. The query 
”What are my most preferred affordable movies?” can be expressed by: 
wc(vpri ce< 7 (Movies)). The answer can be presented by displaying t5 first, 
and then tl and t2. 

The work of Kiessling EH extends this idea of having an operator dedicated 
to preference expression, to enrich the SQL syntax with a PREFERRING 
clause that enables the use of specific preference constructors. Each preference 
constructor can be use to express a specific atomic preference. Preferences 
can be composed with Pareto or prioritisation. 

Example 8. Consider the following queries expressed with Preference SQL: 

1. SELECT * FROM Movies PREFERRING HIGHEST(Duration) 

2. SELECT * FROM Movies PREFERRING Genre IN ( 'Drama', 'Thriller' ) 

The model of preference specified by the first query is the following: for 
some duration x and y, ( x > highest y) if value x is greater than value 
y, meaning that movies with highest durations will be preferred. For the 
second query, the model of preference says that Drama and Thriller movies 
will be preferred over any other genre. More formally, if x and y are two 
movie genres, ( x >in(‘ D rama', ‘Thriller') y) if £ £ {‘Drama', ‘Thriller'} and 
y ^ {‘ Drama' , ‘Thriller’} . 

A restricted form of Kiessling’s SQL extension is the addition of the Skyline 
operator to SQL m- This is a restriction in the sense that, originally, Sky- 
line queries feature only numerical attributes and Pareto composition. This 
operator is defined as follows. The syntax of a skyline clause is: 

SKYLINE OF <h MIN, . . . , d k MIN 

4+i MAX, ..., d, MAX 
di + 1 DIFF, . . . , d m DIFF 

The semantics of such a clause is that a tuple p = (pi, ... ,p n ) dominates 
a tuple q= (qi,...,q n ) if: 

Pi < qi , for i = 1, ... ,k 
Pi > qi, for i = k + 1 , . . . , l 
Pi = qi , for i = l + 1 , . . . , m 

Query Expansion 

In the absence of a dedicated preference operator, a regular user query can 
be processed and transformed with preferences, resulting in another regular 
query that is typically a subquery of the initial one. We use the work of EH 
to illustrate this approach. 

In this work, preferences come from a user profile which is modelled as 
a graph of atomic quantitative preferences of the form (selection condition, 
score), where selection is a regular selection predicate (that may be used to 
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join two tables) and score is a real between 0 and 1 indicating the inten- 
sity of the preference. Atomic preferences are composed using a very simple 
principle: composition of preference (si, id) with preference (s2,v2) results 
in preference (si A s2,id x v2). 

Query expansion is performed as follows. First, given a query, the k most 
relevant preferences are selected from the profile. Then the selected prefer- 
ences are added as hard constraints to the query, and the query is finally 
executed. 

Example 9. Consider the following user query: SELECT title FROM Movies 
WHERE duration < 120. Suppose that the best preference selected from the 
profile of the user who wrote the query is ” I like Lynch as Author” . Then the 
query is modified, resulting in: SELECT title FROM Movies WHERE duration < 
120 AND Author=’Lynch’. Note that in this example, if the query is evaluated 
over the instance given in Example E then the result is be empty. 

This work has been extended to take into account constraints like result 
cardinality or execution time m- 

3.4.2 Query Recommendation in Databases 

Recently, to our knowledge, there has been only two attempts to formal- 
ize database query recommendations for exploration purpose |!'2H' F2I- It is 
important to see that given the context of database exploration, a direct 
transposition of the users x items matrix is not relevant. Following m , we 
use the categories introduced Section E3 to categorise these approaches. 



Current State 

In [I]J|, the authors focus on the current state approach and propose two 
techniques to recommend queries based on the database instance and/or 
the current query answer. The first technique, called local analysis, analyse 
the answer to the user query to discover patterns and use these patterns 
to recommend. The second technique, called global analysis, extends this 
principle to the entire database instance. The instance would have to be 
analysed off-line, for instance to discover correlations among attribute values. 

Example 10. Consider the current query: SELECT Author, Genre FROM Movies 
WHERE Duration > 100. Suppose that by analysing this query answer, it is 
found that the result has a lot of tuples whose genre is Drama. Then, a possi- 
ble recommendation would be: SELECT Author, Genre FROM Movies WHERE 
Genre='Drama’. Suppose now that a global analysis of the database instance 
shows that value ‘Cohen’ for Author is correlated with value ‘Comedy’ for 
Genre. Then, if the current query is: SELECT * FROM Movies WHERE Au- 
thor=’Cohen' a recommendation would be: SELECT * FROM Movies WHERE 
Genre='Comedy'. 
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History Based 

For [ 23 !, the problem of query recommendation is viewed as a sessions x 
tuples matrix. With this approach, a query is represented as a vector whose 
arity is the number of tuples of the database instance. The basic approach 
considers that each component of such a query is either a 1, if the query used 
the tuple, or a 0 otherwise. A session is also represented by a binary vector 
which is the logical or of the vectors of the queries of the session. Consider 
a particular session S c called the current session. Recommendations for S c 
are computed as follows. First, sessions similar to S c are found, using some 
vector similarity measures like e.g., cosine. For those session closest to S c , 
the queries they contain are also compared to S c using the same similarity 
measure. Finally, the queries of those sessions that are the most similar to S c 
are recommended. 

In subsequent works m . the authors focus on fragments (attributes, ta- 
bles, joins and predicates) of queries and consider thus a sessions x query 
fragments matrix. In this work, the matrix is used column-wise in the sense 
that recommendation computation relies on fragment similarity instead of 
session similarity. 

Query Completion 

We conclude this section by noting that recommendation also make sense 
to assist the user writing a query. For instance, the SnipSuggest approach 
m is a collaborative approach that uses a query log to provide on-the-fly 
assistance to users writing complex queries. The query log is analysed to 
construct a graph whose nodes are the fragments appearing in queries and 
edges indicate the precedence. The edges are labelled with the probability 
that a fragment follows another fragment in the logged queries. Given the 
beginning of a current query, the graph is used to complete the query with 
the fragment that is the most likely to appear. 

Example 11. Suppose that the query log contains only the following two 
queries: SELECT Title, Genre FROM Movies WHERE Actor='C. Lee' and SE- 
LECT Title FROM Movies WHERE Author='Allen'. Suppose that a user starts 
writing a query with only SELECT. It can be then suggested the attribute 
Title since this attribute is the most likely to appear according to the query 
log. 

3.5 OLAP Query Personalisation and 

Recommendation in Data Warehouses 

In this section, we introduce the personalisation and recommendation ap- 
proaches that are tailored to data warehouses queries. We start by reviewing 
the salient peculiarities of data warehouses compared to classical relational 
databases. 
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3.5.1 Peculiarities of Data Warehouses 

As evidenced by e.g., E3E3I2EI basic peculiarities of typical data warehouse 
can be summarized by: 

1. A data warehouse is a read-mostly database and its instance has an in- 
flationist evolution (data are added, never or very seldom deleted). It is 
for instance likely that a user issues periodically the same sequence of 
queries more than once. 

2. A data warehouse is a database shared by multiple users whose interests 
may vary over time. It is argued in mmmu that user preferences 
are of particular importance in data warehouse exploration. It would for 
instance be important to issue recommendations computed from other 
users’ habits (e.g., in a collaborative filtering fashion) and at the same 
time respecting the user interests. 

3. A data warehouse has a particular schema that reflects a known topol- 
ogy, often called the lattice of cuboids, which is systematically used for 
navigation EH- Rollup and drilldown operations that allow to see facts 
at various levels of detail are very popular in this context. 

4. A typical analysis session over a data warehouse is a sequence of queries 
having an analytical goal, each one written based on the past results of 
the session. They may be expressed in a dedicated query language (like 
e.g., MDX (221), ma y produce large results that are usually visualised as 
crosstabs. Moreover, the session has a sense w.r.t. some expectations. For 
instance, the user may assume a uniform distribution of the data OEI 
or that two populations follow the same distribution m Sessions (as 
sequences of queries) are of particular importance in this context since 
by this sequence the user navigates to discover valuable insights w.r.t. 
her expectations or assumptions. 

We now describe how these peculiarities have been taken into account 
when personalising or recommending OLAP queries. 

3.5.2 Personalisation of OLAP Queries 

To the best of our knowledge, only two works deal explicitly with the per- 
sonalisation of OLAP queries. The first work m borrows from the query 
expansion paradigm, where a query expressed in MDX is transformed into 
another MDX query by using elements of the user profile. This approach fea- 
tures the same characteristics as that of m in terms of formulation effort, 
prescriptiveness, proactiveness and expressiveness. The second work (SHE! 
is inspired by the use of explicit preference constructors enabling to express 
complex preferences directly within the query. This approach features the 
same characteristics as that of GH in terms of formulation effort, prescrip- 
tiveness, proactiveness and expressiveness. 
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Query Expansion 

The work of m is inspired by that of the main difference being that it 
does not use scoring function to represent preferences but relies on a qualita- 
tive model instead. It proposes to expand a current MDX query by another 
MDX query q using those elements of the profile that guarantee that (i) q is 
included (in the classical sense of query inclusion) in the current query, (ii) q 
only fetches preferred facts w.r.t. the profile and (iii) q respects a visualisation 
constraint. In this work, the user profile is given by a qualitative preference 
model relying on orders defined over dimension names and over attribute 
values. The visualisation constraint is used to indicate the maximum number 
of references (i.e. , positions extracted from a data cube) that can be used for 
displaying the query answer. Note that this work assumes that the instances 
of the dimension tables can be used to compute the most preferred references 
in terms of the user profile. 

Example 12. Consider the following MDX query: 

SELECT CROSSJOIN({City.Tours, City.Orleans},Category.Members) 

ON ROWS 

{2003, 2004, 2005, 2006} ON COLUMNS 

FROM SalesCube 

WHERE (Measures. quantity) 

This query asks for 2 cities, 5 members of the Category level and 4 years, 
i.e., 40 cells of the cube. Suppose that the device for displaying the query 
result imposes the use of a cross tab with 2 axes with only 4 positions, i.e., 
only 16 cells. Suppose also that the user profile states that the most recent 
years are preferred and that, for the product dimension, the preference are: 
Electronics < shoes < cloth < food < drink. 

The personalisation process is as follows. First the most preferred refer- 
ences are computed. In this example, that would be references ( Orleans, 2006, 
drink, quantity) and (Tours, 2006, drink, quantity). Then it is checked if this 
set of references complies with the visualisation constraint. In this example, 
as there are 2 preferred references for 16 available positions, this is indeed 
the case. When it is the case, then the second preferred references are added 
to the first preferred and again the whole set is checked against the visual- 
isation constraint. The process stops when no more positions are available 
in the cross tab used to display the result, or all the references requested by 
the query have been included in the cross tab. Then the set of references is 
used to construct the personalised query, which, in this example, would be: 
SELECT CROSSJOIN(City.Tours, City.Orleans.Category.Food, Category.drink) 
ON ROWS 

2003, 2004, 2005, 2006 ON COLUMNS 
FROM SalesCube 
WHERE (Measures. quantity) 
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Use of Dedicated Operators 

The work of 1231332 , is inspired by that of m where preferences are written 
for each query using a dedicated PREFERRING clause added to the MDX 
language. In this work, preference constructors are tailored to the multidi- 
mensional context. For expressing atomic preferences, the main differences 
with m are thus the following: 

• The semantics of preferences over attribute values is hierarchy-driven. 

• Preferences can be expressed over levels and thus over cuboids. 

• Preferences can be expressed over measures. 

As in m, atomic preferences can be composed using Prioritization or 
Pareto. A specific implementation has been developed for evaluating prefer- 
ence queries expressed in this language. 

Example 13. We illustrate two representative preference constructors. Sup- 
pose that one of the hierarchies of the 5-dimensional schema is named RES- 
IDENCE and has levels City, State, Country. The preference PREFERRING 
City IN ’LA’ not only means that City ’LA’ is preferred over all other cities, 
but also that ancestors and descendants of ’LA’ are preferred over val- 
ues of this hierarchy that are neither ancestors nor descendants of ’LA’. 
For instance, the reference (LA, all, 2010, F, all) will be preferred over, 
say, (NY, all, all, all, all), and the reference (California, all, 2009, all, all) will be 
preferred over e.g., (NY, all, 2010, all, all). Note that (LA, all, 2010, F, all) 
~ (California, all, 2009, all, all) 

As another example, PREFERRING RESIDENCE CONTAIN City means 
that the facts grouping by level City are preferred over the facts group- 
ing by another level of the RESIDENCE hierarchy. In that case, reference 
(LA, all, 2010, F, all) is preferred over (California, all, 2009, all, all). 

3.5.3 Recommending OLAP Queries 

Surprisingly, although there are many works in relational database dealing 
with query personalisation and quite a few works around query recommen- 
dations, in data warehouse, that seems to be the other way round. Indeed, to 
the best of our knowledge, the existing approaches for recommending OLAP 
queries are pr?n [T7i rr^i rrm m rrm rrm rm rm rrm m rrrj Note that the 
older works do not make use of the term query recommendation. 

All these approaches are proactive, prescriptive, require a low formulation 
effort and have low expressiveness. To categorise them, we use and refine 
the categories proposed by m- We distinguish between (1) current-state 
methods exploiting external information (more precisely a user profile), (2) 
current-state methods based on expectations, (3) history-based methods ex- 
ploiting query logs, and (4) hybrid (history and current-state) methods based 
on expectations. 
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Methods Exploiting a Profile 

The works in this category [35, i]Z suppose that a profile is provided together 
with the current query. The profile expresses user preferences over the tuples 
of the fact table using a quantitative approach. As in classical query expan- 
sion, the profile is used to modify the current query. The main difference with 
m is that the expanded query is not necessarily a subquery of the initial 
query. 



Methods Based on Expectations 

The works in this category mmmmm rely on discovery driven analysis, 
where a model on unseen data is used together with the already seen data, i.e., 
the results of the already launched queries of a given session. The strongest 
deviations to the model are recommended. 

We briefly recall the concept of discovery driven analysis. To support in- 
teractive analysis of multidimensional data, m introduced discovery driven 
analysis of OLAP cubes as the definition of advanced OLAP operators to 
guide the user towards interesting regions of the cube, lightening the burden 
of a tedious navigation. These operators are of two kinds. The first kind tries 
to explain an unexpected significant difference observed in a query result by 
either looking for more detailed data contributing to the difference Eg, or 
looking for less detailed data that confirm an observed tendency The 
second kind proposes to the user unexpected data in the cube w.r.t. the data 
she has already observed, by adapting the Maximum Entropy Principle j3j. 

Recommendation methods based on discovery driven analysis consist in 
recommending queries that result in data deviating the most from a consen- 
sual model (that we call expectation). Among the works in this approach, 
the main difference is the model used, i.e., the nature of the expectation. 
|23 [23 ED rely on the assumption of a uniform data distribution. m assume 
a statistical independence of the cube’s dimensions. 

Example 14 ■ We briefly illustrate the work of 0|. Suppose that a user asked 
for the total sale of Cheese in Europe for Quarter 1 of Year 2010. The result 
indicates that this sale is 100. Suppose that it is known (because the OLAP 
server transparently evaluates queries related to what the user is doing) that 
at the country level, this 100 is perfectly distributed among the 4 countries 
Europe drills down to (i.e., the sales of Cheese in these countries are 25 
for each of France, Italy, Spain and U.K.). If such an uniform assumption 
is usually believed to be true, then this drill down does not correspond to a 
relevant recommendation. On the other hand, if at the month level, it is found 
that the sales of cheese in Europe is quite different from one month to another, 
then drilling down to the month level would be a good recommendation. 

Note that [3H| and |22| gjjj compute suggestions using only the current query 
while considers every former query result. 
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Methods Exploiting Query Logs 

The works in this category ga eu sa S3 suppose that a query log is used 
to look for similarities between the current session and former sessions, to 
extract one past query as the recommendation. 

The methods in this category mainly differ by the way they consider the sim- 
ilarity between sessions and/or queries, gl. use the classical Levenshtein (for 
session) and Hausdorff (for queries) distances. >42, .13) group queries by com- 
mon projections and selections, and use a Markov model to represent sessions. 
pTTTj cluster queries using the Hausdorff distance and detects if the current ses- 
sion is a prefix of some existing session. piHSl and m identify a matching 
position for the current session in the closest former session and recommend 
the query after this position. m recommend the last query of the session that 
is the closest to the current one. The score that estimates the interest of a query 
for the session is computed based on the proximity of the current session with 
the log queries SU or based on the probability to have the recommended query 
following the current query mm- 



Hybrid Methods 

The only work in this category is fTTl In this work, that assumes a fix 
data warehouse instance, the query log is processed to detect discovery driven 
analysis sessions. Sessions are associated with a goal, and recommendations 
are those queries of former sessions having the same goal as that of the 
current session. More precisely, the model of expectation is the one proposed 
in ESI EH E|- The log is processed to discover pairs of facts that show a 
significant difference (like e.g., a drop of sales from one year to the following 
year). Those pairs are then arranged into a specialisation relation based on 
the hierarchies. At query time, if the current query investigates a pair that 
relates to the pairs discovered in the log, then the past queries featuring such 
pairs are recommended. The main difference with the techniques of [33 EH El 
is that only the log is searched for interesting deviations. 

Example 15. Suppose that the current query result shows that there is a big 
drop of sales in the sale of Cheese in Europe from year 2009 to year 2010. 
Suppose that in the log, it is found that past queries have already investigated 
a big drop of Cheese sales in France for the same couple of years. Then such 
queries will be recommended. 

3.6 Conclusion 

This lecture introduced OLAP query personalisation and recommendation, 
by first defining the basic concepts and then providing an overview of the ex- 
isting approaches in relational and multidimensional databases. We restricted 
the scope of this presentation to the problem of, given a database query q, 
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compute a query q' C q (personalisation) or compute a query q' such that nei- 
ther q' C q nor q C q' (recommendation). We also categorise the approaches 
in terms of formulation effort, prescriptiveness, proactiveness, expressiveness, 
and the type of data needed: current state, history based, external data. With 
respect to this categorisation, it can be noted that many interesting combi- 
nations could investigated. For instance, to the best of our knowledge, there 
exists no approach that requires a low formulation effort, being proactive and 
not prescriptive while remaining highly expressive. 

One of the main limitations in this field of research is that assessing effec- 
tiveness of the approach is very difficult since it should be based not only on 
real data but also on users’ feedback, both being very difficult to obtain due to 
contexts the typically involves sensitive data. This is for instance illustrated 
by the fact that, in a data warehouse context, there exists no categorisation 
of the users’ navigational behaviours, whereas such a categorisation exists in 
the web 10]. 

The domain of query personalisation and recommendation is still in its 
infancy. A lot of open issues can be listed, ranging from user privacy to the 
quality of recommendations and personalisations, most of these issues being 
also relevant beyond the data warehouse context. For instance, the learning 
and revision of preferences or navigational habits is one of these challenges. 
It is of paramount importance for computing more accurate and reliable 
personalisations and recommendations. A preliminary work tailored to the 
data warehouse context m is a first step in that direction. 
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Summary. We discuss the novel problem of supporting analytical business in- 
telligence queries over web-based textual content, e.g., Bi-style reports based on 
100.000s of documents from an ad-hoc web search result. Neither conventional 
search engines nor conventional Business Intelligence and ETL tools address this 
problem, which lies at the intersection of their capabilities. Three recent devel- 
opments have the potential to become key components of such an ad-hoc analysis 
platform: significant improvements in cloud computing query languages, advances in 
self-supervised keyword generation techniques and powerful fact extraction frame- 
works. We will give an informative and practical look at the underlying research 
challenges in supporting ” Web-Scale Business Analytics” applications that we met 
when building GoOLAP, a system that already enjoys a broad user base and over 
6 million objects and facts. 



4.1 Introduction 

Which companies collaborate with Boeing? Are these organizations also col- 
laborating with Airbus or Fokker? Do employees of these companies have a 
criminal record? Who published the information? 

Each day new pages emerge in the Web that may contain a textual rep- 
resentation of facts for answering these questions. Strategic decision makers 
may frequently research the Web for questions like these. Often answers to 
these queries might not be published in textual or structured form by Web ‘in- 
formation aggregators’ like Wikipedia.com , Freebase.com , Trueknowledge.com 
or Yago [TJ. Rather, this rare factual information is hidden in unstructured 
Web text on a few Web pages of news agencies or blogs. Unfortunately, col- 
lecting factual answers from these pages with a general Web search engine is 
still a dreaded process for a user. 

One option to populate a fact base is to crawl a large document collection. 
For instance Google Squared pjj populates its data base with facts from the 
large corpus of the general Web. The system extracts these facts from tables, 
lists and from text with open information techniques. However, in 0 we ob- 
served that only a small fraction of a large archive de facto contains factual 
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information. Hence, strategies that might execute a full scan over the entire 
archive can drastically waste processing and storage resources (see also 0). 
Another option is discovering facts in retrieved pages from ad-hoc keyword 
search. Unfortunately, this is still a tedious task, since Web search engines 
do not return aggregated factual information. The heuristic of a search en- 
gine user is: type in keywords as queries, ‘extract’ relevant facts from the 
top-k documents, filter out relevant facts and compile facts into a structured 
fact table. Therefore the user typically repeats this process multiple times in 
order to complement missing attribute values and to enhance the chance to 
discover unseen facts. 

Significance of our approach. We present GoOLAP, a system that aims 
to automate the fact retrieval process from Web search engines. GoOLAP 
has three significant aspects: (1) It provides powerful operators for analytical 
Web research on textual information, such as augmenting facts for an object, 
tracing back the textual origin of a fact or comparing factual information for 
a list of objects. (2) As a natural alternative to crawling a large proportion of 
the Web, GoOLAP interprets user interactions as input to identify missing 
facts. These user interactions trigger a fact retrieval process with the goal to 
populate the GoOLAP fact base from Web-scale indices of existing search 
engines in an ad-hoc fashion. This process is powered by the FactCrawl en- 
gine that leverages sophisticated keyword generation techniques H (SI and 
page classification techniques to retrieve only pages that likely contain 
missing and rare factual information 0 (3) GoOLAP combines these ap- 
proaches to drastically avoid crawling, indexing and extracting potentially 
billions of irrelevant pages. GoOLAP’s human-machine machine generated 
fact base nearly reaches 6 million facts and objects; a dimension that is still 
an order of magnitude smaller than community- generated fact bases, such as 
DBpedia [Bj or CrunchBase [9;,. However, our prototype extracts facts in a 
hybrid approach: It leverages the crowd, the power of GoOLAP users, to dis- 
cover new objects in queries, extracts and augments with the help of parallel 
running machines new factual information and again utilizes the power of 
GoOLAP users to verify and re-trigger this process, for instance in the case 
of incomplete information. 



4.2 Related Work 

We discuss relevant related work in the areas of focused fact retrieval from 
full text indexes, Open Information Extraction and fact search engines. 

Keyword query generation with QXtract. The authors of QXtract m 
pioneered work on automatically generating keyword phrases for fact retrieval 
from full-text indices. The system executes the following stages in a one 
time learning process: sample seed documents, observe/score phrases in these 
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documents, query a search engine with the most promising phrases and for- 
ward the retrieved pages to an information extractor. 

Sample seed documents. QXtract requires a small set of manually spec- 
ified ’seed’ facts to retrieve an initial set of sample pages for training. The 
size of this sample influences the precision of the learning process; the authors 
suggest between 100 and 2500 pages. After sampling, the pages are forwarded 
to the extractor to identify which documents are relevant for the extraction 
task. 

Observe and score phrases. QXtract utilizes three classification ap- 
proaches for observing and scoring unigram terms from the documents in 
the seed sample: An SVM-based classifier, the OKAPI system IEP and the 
rule-based classifier Ripper are trained on the set of seed documents. 
Each approach returns a list of terms, ordered by their significance for clas- 
sifying relevant and irrelevant documents. The authors of QXtract propose a 
threshold based technique to assemble and select relevant phrases as keyword 
queries out of these terms. 

Determine document retrieval order. Next, QXtract retrieves a ranked 
set of documents from the index for each generated keyword query, where the 
queries are selected from the three lists in a round robin fashion. All docu- 
ments retrieved for a phrase are forwarded to the fact extractor. The process 
continues with selecting the next highest ranked phrases from each list and 
forwarding the resulting documents to the extraction service. It terminates 
once no more phrases exist in any list. The authors evaluated their approach 
on the two fact extractors Company Headquarter and DiseaseOutbreak and 
achieved a recall of about 20% while processing 10% of the documents in the 
test corpus. In jj: the authors describe a framework incorporating the idea 
of automatic keyword generation as an execution strategy. 

Phrase generation from SQL queries. Authors of m extract and rank 
potential keywords from attribute names and values of a structured query. 
They apply a greedy approach that selects terms based on their relevance 
measures informativeness and representativeness. Our approach makes use 
of a similar idea: We trust that Web page authors develop a set of common 
words and grammatical structures to express important factual information 
in natural language. 

Self-supervised keyword generation. In |5. we published an approach 
that applies a self-supervised keyword generation method to extract mean- 
ingful phrases which often correlate with factual information in sentences. 
Our approach utilizes open information extraction techniques to generalize 
fact-type-independent lexico-syntactic patterns from sentences m Recently, 
we presented an extended version of this work on fact retrieval 0, called 
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FactCrawl. Our framework allows for the integration and evaluation of dif- 
ferent feature generation methods. 

Fact extraction. Multiple projects, such as CIMPLE |J3j, AVATAR \ 1 tij . 
DIAL HU, DoCQS [EU, PurpleSox |13| describe an information extraction 
plan with an abstract, predicate-based rule language. Our approach tries to 
’re-engineer’ common patterns from observing fact extractors and other pat- 
terns that appear frequently with factual information on a page as features. 

Fact search. Recently, the commercial search engine Google Squared (2 
populated a fact base with machine extracted information from Web tables 
and Web lists. That prototype has access to any page in the Google index 
that contains a table or a list. Contrary, GoOLAP can neither rely on nor 
compete with this massive amount of data. Rather, we utilize user triggered 
interactions to generate keyword queries only for interesting and frequently 
requested facts. These keyword queries leverage the index of a Web search 
engine and retrieve only pages that likely contain facts from which we aggres- 
sively filter our irrelevant pages. With both techniques, generated keywords 
and aggressive Web text filters, we leverage the crawling power of a Web 
search engine, but can drastically limit our fact extraction costs and avoid 
crawling and indexing potentially billions of irrelevant pages. Google Squared 
focuses on facts that are either represented as Web lists or as Web tables 0 . 
Thereby Google Squared may miss important facts in the much more com- 
mon textual representation. Our extraction approach is complementary since 
we focus on natural language text only. 

Open information extraction and fact search. The Open Information 
Extraction approach m of Google Squared moves the semantic interpreta- 
tion of fact types to a human user. Contrary, GoOLAP relies on a predefined 
set of relation extractors. These extractors return well defined semantics for 
domain specific facts. Thus, this approach allows the aggregation and in- 
terpretation of extracted facts by a machine, such as by an OLAP query 
processor. In addition, these facts can be integrated with user contributed 
fact bases, such as the project DBpedia 0. 

Combining OLAP cubes and Web Documents. The idea of combining 
OLAP cubes and web documents is not new. In fact, many digital libraries 
like ACM, DBLP and Microsoft Research compute aggregations on document 
meta data. Authors of m , m and m published principles for integrating 
OLAP Cubes and Web documents. Our work utilizes these approaches and 
goes one step beyond: Our design allows users to trigger a focused fact re- 
trieval process that allows the user to create domain specific specific OLAP 
cubes, to identify missing facts and to restart this process until the user be- 
lieves that sufficient facts exist. Thereby the system executes and controls 
the the retrieval of missing facts while the user exploits OLAP cubes. 
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The relation to Linked Open Data. Much factual information is read- 
ily available as LOD (in RDF) j§j which can probably significantly increase 
the quality of the results of the GoOLAP system described. Alternatively, 
GoOLAP can also contribute newly extracted facts to the LOD cloud. 

4.3 Exploring GoOLAP’s Fact Base 

This section describes how a user may explore the GoOLAP fact base; see also 
E3B3Q for a discussion of the research challenges and 123 for an overview 
of exploratory search. In this process the system retrieves documents , such 
as HTML pages, issues keyword queries against a Web search engine, such 
as sequences of lexical tokens, and stores, documents, facts, references to 
documents and the retrieval date in a RDBMS that follows the database 
layout of m ( see also next Section). 

In the rest of this section we introduce five operations that enable a user to 
interact with a system. These operators reflect the typical process of an ex- 
ploratory search interpret the query, retrieve augmented facts for the query, 
expand a single fact into a list of related facts, and trace back the origin if a 
single fact: 

Interpret. A query typed into a search field can express different inten- 
tions. For example, a user has the intention Augment facts for Boeing , so he 
enters “Boeing” into the search field. While he is typing the query, the sys- 
tem returns multiple interpretations in an auto complete dropdown, such as 
company Boeing , person Boeing or product Boeing Ifl. The user may select 
an interpretation for Boeing (e.g. company Boeing ) to send the interpreted 
query to the system. 

Augment. This operation collects and augments information for a particu- 
lar object. In addition, the system returns images of the selected object and 
shows links to objects of the same type, such as other companies. Finally, 
the system returns a list of ranked facts. This process has two steps. (1) Fact 
type enumeration. The system takes as input the selected object. Each object 
corresponds to a specific type for which the system looks up known fact types 
in a system- wide ontology. Currently this ontology holds over 70 different fact 
types. For instance, for an object of the type company , the system returns 
fact types such as layoffs , competitors or products. (2) Return most interesting 
facts for each type. For each fact type the system queries the corresponding 
table for non-empty facts. Next, it ranks facts for each type by the notion 
of interestingness m- For the exact process of object reconciliation we rely 
on existing approaches, such as 123 and j2B| ■ Note, that this functionality is 
still major challenge when integrating data coming from different (probably 
inconsistent) data sources. 
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Expand. This operation enables a comparative analysis across facts for mul- 
tiple objects. Consider a user who wants to compare different aircraft vendors, 
such as Boeing , Fokker or Northrup Grumman. For each object in this user 
defined selection our system projects a set of facts per type and creates a tab- 
ular fact sheet. Objects are displayed as rows, fact types as columns. Each 
cell lists facts ranked by interestingness. Note that this table represents a 
non- first-normal- form of the underlying fact base; therefore multiple values 
per cell may exist. 

Trace back. For each fact, the user can access the original document at any 
point of the assessment process in order to continue his research. 

Subscribe. Each user has the possibility to register a profile on the GoOLAP 
site. This enables a registered user to give feedback to the results by clicking 
on agree or disagree buttons or editing factual data in-place. If a user is 
interested in a particular object, e.g. his own company, he can subscribe to 
the object by clicking on the star next to the object’s name. The system then 
tries to fetch more data for the subscribed object. 

4.4 User- Triggered Fact Retrieval from an Inverted 
Index 

The main goal of our framework is to be able to mine all the facts contained 
in a document collection while processing as few documents as possible. To 
solve this task we adapt keyword based document retrieval systems, such as 
Web search engines, to retrieve only those pages that contain a fact (a factual 
statement expressed in unstructured text which an information extractor can 
detect and extract) . Our fact retrieval method emulates a search engine users 
behavior to solve the ’’inverse retrieval” problem of identifying discriminative 
keywords for the retrieval of relevant documents. Figure 14. II introduces our 
general fact retrieval process: 

Step 1: Interpret and execute user query. The system collects user in- 
teractions, determines the intent of the user and executes a query against 
the local fact base. While a crawl-based fact retrieval execution guarantees 
that all documents in an archive are processed, an indexed-based execution 
might miss some relevant documents. For compensating this disadvantage, 
GoOLAP observes the completeness and rarity |Z| of returned results to iden- 
tify potentially missing facts. If necessary, GoOLAP schedules a fact retrieval 
activity (see Table mi) . For example, if a user requests facts for an object 
that is not yet contained in the GoOLAP base, the system will trigger the 
retrieval of pages that contain facts for the missing object. 

Step 2: Generate keyword query to retrieve missing facts. GoOLAP 
forwards the corresponding textual object representation and the semantic 
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Fig. 4.1. User triggered fact retrieval from Web search engines. The fact retrieval 
process is interactively triggered by user interaction, here INTERPRET and AUG- 
MENT. 



type of the missing fact to FactCrawl j^j , a framework that emulates a search 
engine user’s behavior to solve an inverse retrieval problem. This frame- 
work queries a Web search engine with automatically generated keywords, 
re-ranks the resulting list of URLs according to a novel fact score and forwards 
only promising documents to a fact extractor. FactCrawl generates keywords 
using structural, syntactic, lexical and semantic information from sample 
documents. Thereby FactCrawl estimates the fact score of a document by 
combining the observations of keywords in the document. 

Step 3: Filter out irrelevant pages. GoOLAP utilizes a fact predictor |||; 
this technique reliably filters out irrelevant pages and only forwards relevant 
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Table 4.1. GoOLAP collects and interprets more than 30 user interactions that 
may trigger a fact retrieval process. This table displays user interactions and system 
activities contained in our demonstration scenario. 



User interaction 


Intention: The user... 


System activity: The system... 


Retrieval activity: FactCrawl... 


Click a link that points to 
GoOLAP 


discovers a GoOLAP result page for 
C ompany.Boeing in another search engine and 
clicks on the link 


displays the AUGMENT result page for 
Company.Boeing 


retrieves more facts for 
Company.Boeing, such as 
headquarter, layoffs etc. 


Search with keyword 


wants to augment facts for the object 
Company.Boeing and types Boeing into the 
search field 


interprets the input and displays an autocomplete 
dropdown with multiple interpretations, such as 
C ompany.Boeing, Product.Boeing 747 etc. 


ranks the order of interpretations when 
the user selects Company.Boeing 


Search missing object 


wants to augment facts for an object X that is not 
yet contained in the GoOLAP fact base 


displays a message to inform the user about the 
empty result and proposes a subscription of X 


retrieves facts for the object X with 
high priority 


INTERPRET keyword 
search via autocomplete 


wants to augment facts to the object 
C ompany.Boeing displayed in the autocomplete 
and clicks on it 


augments facts to Company.Boeing and displays the 
AUGMENT page for that object 


ranks the popularity of 
Company.Boeing and retrieves more 
facts, such as headquarter, layoffs etc. 


Show AUGMENT result 


requests an AUGMENT result for the object 
Company.Boeing 


displays the top rated facts for Company.Boeing and 
three arbitrary documents containing information 
about Boeing 


ranks the popularity of 
Company.Boeing and retrieves more 
facts 


TRACE BACK the 
document 


wants to trace back the textual origin of the fact 
EADS competes Boeing and clicks on the 
document symbol 


displays a snippet from the source document and 
explains the origin of a fact by highlighting the 
sentence where the fact was extracted from 


ranks the popularity of the fact EADS 
competes Boeing and tries to retrieve 
more evidences of this fact 


EXPAND list of objects 


wants to compare Competitors of Boeing and 
clicks on the EXPAND icon in the column header 


expands the list of Competitors of Boeing and 
displays an interactive table view for comparison 


retrieves more facts for 
Company.Boeing and the competitive 
relation to Airbus, EADS, Fokker etc. 



pages to the GoOLAP fact extractors. Most importantly, our fact predictor is 
two orders of magnitude faster than the fact extractor. The fact predictor is 
based on a support vector machine that evaluates pages on a sentence level, 
where each sentence is transformed into a token representation of shallow 
text features. 

Step 4: Fact extraction. GoOLAP extracts factual information and re- 
solves objects with home-grown extractors based on Hi and commercial 
extractors, such as OpenCalais m ■ Overall, GoOLAP provides more than 
70 different fact extractors for the domains people, companies, media and 
entertainment. As a result, the system wide ontology is limited to the set of 
fact types provided by the extractor, but also open to previously unseen fact 
instances that the fact extractor discovered in retrieved documents. An inter- 
esting open research challenge is the alignment of our fact base with DBpedia 
/ Freebase or any other large scale factual knowledge base. OpenCalais may 
provides only marginal quality on texts from domains that are different from 
business and news, such as for medical documents. Therefore our system is 
open to additional fact extractors such as extractiv.com or alchemy.com. The 
retrieval process then aggregates the extracted facts and stores them into the 
local fact base. Selecting those extractors which provide best performance on 
a certain domain, when there are alternatives available, is an open challenge. 

Data storage and parallel execution engine. Our execution and storage 
system bases on a hybrid solution is hybrid, along traditional approaches, 
such as Em, m ■ Even though it would be more interesting to have a single 
system, we chose this hybrid approach for the following reasons: 

• Factual information is small compared to the raw text documents. There- 
fore we chose to store the local fact base, the ontology, objects and small 
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document snippets of a few 100 bytes in a horizontally partitioned MySQL 
cluster database on a raid-based 256 GB solid state disc (SSD) array. The 
technology allows GoOLAP users to retrieve as well as to update and rate 
facts. 

• The current user base may generate interactions that trigger the retrieval 
of several hundred thousand pages daily. Current RDBMS may only hold 
the documents for several hundred days with a reasonable cost/storage 
balance. Therefore, we manage the sheer mass of raw textual documents 
in a Hadoop j3D| cluster, using commodity hardware, HDFS and HBASE. 
The MySQL cluster references these documents via unique keys. 

• Fact extraction is computational processing intensive. GoOLAP processes 
the massive amounts of requests for new facts in a parallel fashion on a 
cluster of 12 nodes, each running with 4 GB RAM on a 2x 2.8 GHz Intel 
CPU. The parallel query execution engine bases on Hadoop. We chose 
to abstract the functional programming interface of Hadoop through the 
declarative query language JAQL EP and extend JAQL with first-order 
functions for keyword generation with FactCrawl j3|, Web text filtering 
jS| and fact extraction 2S : j- JAQL represented objects, pages, facts and 
interactions with the semi-structured JSON format. Jaqls design has been 
influenced by Pig (32b Hive .'£J. DryadLINQ [33, among others, but 
has a unique focus on the following combination of core features: (1) a 
flexible data model, (2) reusable and modular scripts, (3) the ability to 
specify scripts at varying levels of abstraction, referred to as physical 
transparency, and (4) scalability (see also ). 

4.5 Demonstration Scenario 

Our interactive online demo shows the interplay of users exploring the 
GoOLAP fact base and thereby triggering new requests for retrieving missing 
facts from the index of a Web search engine. Figure HOI shows our demon- 
stration scenario. Our prototype can be reached at www.goolap.info. 

Consider an analyst who wants to have an overview over competitors of 
Boeing. On the GoOLAP start page, she begins to type “Boeing” into the 
search field (1). The auto complete dropdown returns multiple interpretations 
of Boeing, picking the company object by default. After sending the query, the 
following page presents an overview over augmented facts of different types 
( e.g. business location, products or employment) for the company Boeing and 
some arbitrary documents that include information about that object. She 
now opens the table Business: Competitor that shows the top 15 ranked facts 
about competing companies (2). We suppose she is unsure whether EADS is 
a competitor of Boeing, thus she clicks on the document symbol on the right 
of the row to prove evidence for that fact. The next page shows a document 
from www.defenseprocurementnews.com which explains “Boeing and EADS 
are in direct competition” (3). She can help ranking the fact by clicking on 
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Fig. 4.2. A typical user session with the goal to collect facts about competitors of 
company Boeing 

the agree button or find the sources of other facts in the table the same way. 
Next, the analyst desires to compare the competitors. She clicks on the button 
in the column header of the competing companies to expand the list. The view 
presents a tabular list of competitors of Boeing (4). She can manually remove 
an incorrect object in the list or add another one that is missing. Suppose the 
analyst wishes to compare the employee size of the companies. She performs 
an outer join on the table by adding the column Business: Employee Size from 
the dropdown menu in order to get a comparable result of facts (5). If there is 
a wrong result in one cell, she can display alternative values and find a better 
one. Again, she can trace back the evidence of a fact by clicking the document 
symbol. She is now happy with her result and saves the table under the name 
“Aviation Companies”. Now, she can continue her research at a later time 
or share the table to other users, allowing them to make their own changes. 

4.6 Research Challenges 

GoOLAP presents our attempt in providing a versatile approach for inter- 
active fact retrieval from the Web. We believe that studying the new types 
of Web usage and the interplay of fact exploration with fact retrieval - a 
well-known important problem in information retrieval - will open up a lot 
of new challenges and opportunities: 

Ranking fact sources. Only very few domains provide most of the facts 
in GoOLAP (see also Figure [Oil . Often these fact aggregators aim to col- 
lect factual information from news articles or human collaborations. Worse, 
in EH we discovered that most fact types appear only in less than 1% of 
retrieved documents. An interesting direction is a systematic integration of 
these statistical observations into the fact retrieval process; for instance, we 
could observe domains about fact update cycles or could estimate common 
replication strategies across domains. 
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en.wikipedia.org 
answers.com 
groups.yahoo.com 
newworldencyclopedia.org 
imdb.com 
article.wn.com 
blockbuster.com 
schools-wikipedia.org 
fandango.com 
usgennet.org 

0 50.000 100.000 150.000 

Fig. 4.3. GoOLAP receives facts from 40.414 different Web domains. This distri- 
bution shows the top- 10 domains that provided most facts. The X-axis depicts the 
number of extracted facts per domain. 




Quality-based fact ranking. GoOLAP ranks different facts for the same 
object with the measure of interestingness m ■ This measure captures how 
often authors of retrieved pages mention the same fact for the same object. 
The design of more meaningful, potentially quality based, ranking metrics 
desires investigation, such as metrics for ranking ‘rarely appearing’ facts in 
documents which are often requested through user interactions. 

‘Crowd’-based fact sharing and verification. Similar to a data market, 
users may request the system to collect ‘private’ fact collections and share 
these collections with few ‘trusted’ users. How can the system leverage these 
user interactions and fact collections for discarding incorrect facts? Can we 
derive a fact quality measure from observing the sharing of fact collections? 

Active learning of labeled examples from user interactions. Devel- 
oping high-quality information extraction (IE) rules, or fact extractors, is an 
iterative and primarily manual process, extremely time consuming, and error 
prone. In each iteration, the outputs of the ex- tractor are examined, and the 
erroneous ones are used to drive the refinement of the extractor in the next 
iteration. Recently, authors of 1351 introduced a provenance-based solution for 
suggesting a ranked list of refinements to an extractor aimed at increasing its 
precision. The ability to automatically generate useful refinements depends 
on the number, variety and accuracy of the labeled examples provided to 
the system. In most cases the labeled data must be provided by the rule 
developer. Unfortunately, labeling data is itself a tedious, time-consuming 
and error prone process. It would be interesting to investigate whether active 
learning techniques (SB can be combined with techniques for adapting open 
information extraction to domain specific relations pT7] and GoOLAP click 
streams to present to the developer only those most informative examples, 
therefore facilitating the labeling process. 
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4.7 Conclusion 

We introduced GoOLAP.info, a novel class of applications for answering intel- 
ligence queries over natural language Web text. GoOLAP provides powerful 
operators for retrieving, extracting and analyzing factual information from 
textual representations on the Web. At the core of this service lies FactCrawl, 
a powerful engine for the retrieval of unseen or missing facts from a Web 
search engine to a structured fact base. Over time, GoOLAP may evolve 
to a comprehensive, effective and valuable information source. Unlike sys- 
tems that utilize a cost intensive crawling strategy (such as Google Squared), 
GoOLAP elegantly utilizes user interactions, generated keyword queries and 
text filtering techniques to populate an ad-hoc and incrementally improving 
fact base. Compared to existing approaches that crawl and maintain very 
large Web archives, GoOLAP tries to minimize retrieval costs through user 
triggered fact retrieval. 

We outlined exciting challenges, such as the design of an iterative fact re- 
trieval process, the interpretation of user interaction and automatic keyword 
query generation. As mentioned already, there is a lot of experimental work 
done and to do in the future to rank or verify information; see also work 
about ranking interesting assertions in |2fi| . ranking facts m and work on 
identifying the ’true’ origin of an information 129 - 

Finally, GoOLAP gives us access to click streams that allow the academic 
community to study user interactions and to design novel powerful Web re- 
search operators. These challenges should appeal to and benefit from several 
research communities, most notably, the database, text analytics and dis- 
tributed system worlds. 
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Summary. Business Intelligence (BI) solutions allow decision makers to query, 
understand, and analyze business data in order to make better decisions. How- 
ever, as the technology and society evolve, faster and better informed decisions are 
required. Nowadays, it is not enough to use only the information from the own 
organization and making isolated decisions, but rather requiring also to include in- 
formation present in the web like opinions or information about competitors, while 
using collective intelligence, collaborating through social networks, and supporting 
the BI system with cloud computing. In response to this situation, a vision of a 
new generation of BI, BI 2.0, based on the evolution of the web and the emerging 
technologies, arises. However, researchers differ in their vision of this BI evolution. 
In this paper, we provide an overview of the aspects proposed to be included in 
BI 2.0. We describe which success factors and technologies have motivated each 
aspect. Finally, we review how tool developers are including these new features in 
the next generation of BI solutions. 



Keywords: Data warehouses, business intelligence 2.0, web 2.0, real-time, 
cloud computing, collaborative BI, social networks, collective intelligence, 
crowdsourcing. 

5.1 Introduction 

Over the last decade, the use of Business Intelligence (BI) solutions has been 
steadily increasing. Even in the recent recession period, a study from the 
Gartner Group showed that the BI market not only did not decrease, but 
instead it grew 4% pQ. BI solutions allow decision makers to query, under- 
stand, and analyze business data, in order to make better decisions and gain 
a competitive edge. Traditionally, BI applications allow managers and deci- 
sion makers to acquire useful knowledge about the current performance and 
problems of the business from the data stored in their organization, by means 
of a variety of technologies. These technologies range from data warehousing, 
data mining, and OLAP, to business performance management and periodical 
business reports. Research in these areas has produced consolidated solutions, 
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techniques, and methodologies, and there is a variety of commercial products 
available that are based on these results. 

More recently, a new trend in BI applications has emerged: BI applica- 
tions no longer limit their analysis to the data of their own organization. 
Increasingly, they also source their data from the outside, thus complement- 
ing internal company data with value-adding information from the Web (e.g. 
retail prices of products sold by competitors or opinions posted by customers), 
in order to provide richer insights into the new dynamics of business and to 
better support decision-making processes. As a result, BI applications aim 
to assist modern management practices, where decision-making requires a 
comprehensive view of the market and the business environment as a whole, 
thus BI solutions using just internal company data no longer suffices. On the 
other hand, at the same time as data from the Web is being included into 
BI applications, BI applications are also evolving towards the web. Recently, 
internal company information systems are being transformed into BI as a 
service (e.g. hosted BI platforms for small and medium-sized companies) and 
software support to manage business outsourcing or crowdsourcing is the tar- 
get of huge investments and the focus of enormous research efforts by both 
industry and academia. 

Nevertheless, this trend is not limited to just relocate the BI processes onto 
the web, but rather to transform how BI is performed. This transformation is 
being influenced by the apparition of different, new technologies in the web 
2.0, as well as the recent rising of the social networks |2j. Together, these 
factors provide a general vision of which are the features the next generation 
of BI tools should include. However, when analyzing the influence of the 
different factors on BI processes, there is no clear consensus. While some 
authors pin E| focus on the technical aspects and propose to adapt the 
most recent techniques to the current BI tools, others describe a complete 
transformation of the BI processes Pd In our reseach group, and according 
to this new evolution of classical BI, we have been organizing two editions of 
the BEWEB Workshop |H1 trying to allocate this evolution. Furthermore, 
in [2, we also collected different trends related to Business Intelligence and 
the web. 

In this paper, we provide an overview of the different aspects of the so-called 
“BI 2.0” , how the web has influenced BI and how BI is reacting to this influ- 
ence, including the analysis of technical challenges that must be overcomed, as 
well as how the current BI tools are integrating the new features. We analyze 
which are the common aspects envisioned, in order to identify the scope of the 
2.0 generation. In order to visualize how these new aspects could be integrated 
into the traditional architecture of a BI system based on data warehousing, we 
show a potential architecture in figure FTTl In this figure, we can see how tradi- 
tional Extraction, Transformation and Loading (ETL) batch processes are now 
being substituted by real-time integration processes. These processes should 
also integrate data coming from the web or, alternatively, this data should be 
integrated into the analysis at later stages, without being stored in the Data 
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Warehouse (DW). The DW is now distributed in the cloud, which provides 
the necessary scalability. Furthermore, the information from the DW is re- 
trieved through BI services, which send the requested information through the 
network to the users. Finally, these users do not take decisions in an isolated 
manner, but instead interact and interchange information achieving better de- 
cisions in shorter times. This interchange can be done either internally, through 
the business BI system, or between organizations, for example using Business 
Intelligence Networks DU. 




Fig. 5.1. An overview of the new BI 2.0 architecture 



The rest of the paper is structured as follows, Section 5.2 gives a brief 
overview of the recent evolution on BI and DWs. Section 5.3 introduces the 
basic concepts related to BI 2.0. Section 5.4 describes how the new technolo- 
gies and the web have influenced traditional BI processes and tools. Section 
5.5 describes the current technical challenges and the different solutions pro- 
posed to overcome them. Section 5.6 provides an overview of the current BI 
tools and which new features are being implemented. Finally, Section 5.7 
describes the conclusions and sketches the expected evolution in this area. 

5.2 Business Intelligence and Data Warehouses in a 
Nutshell 

Traditionally, BI has been involved in analyzing structured data, aiming to 
improve the business performance and providing a competitive edge. How- 
ever, due to the size of data to be queried, and the nature of the decisional 
environment, a specific design for the database which stores the necessary in- 
formation for the BI process is required u . In order to solve this problem, it 
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is widely accepted that the design of the database schema should be oriented 
to improve the performance, creating a separate database with a schema that 
minimizes the number of joins required to answer a given query. The result- 
ing schema at logical level is named Star Schema, and its variants Snowflake 
Schema and Fact Constellation, depending on how tables are organized. This 
methodology dim has become popular for building the DW, and vendors 
have implemented their own DW designer tools using this approach. 

However, a main problem arises from designing the DW from a logical point 
of view. Since the logical model is designed for storing data and improving 
query performance, it does not consider the analysts’ needs d Therefore, it 
is hard for an analyst to (i) understand the underlying schema, and (ii) obtain 
useful information from the stored data. The main drawback is that the 
logical level lacks multidimensional details, required to specify information 
related to hierarchies and other complex multidimensional structures, which 
organize the data for the analysis task and hide implementation details. 

In order to solve this problem, researchers have proposed a number of 
different conceptual modeling approaches HU ESI ESI HZ|, although unfortu- 
nately, up to now none of them has been accepted as a standard. Nevertheless, 
vendors have also perceived this defficiency, and include their own representa- 
tions of multidimensional structures into their tools, in order to allow analysts 
to query the DW without using SQL queries. However, being able to query the 
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DW does neither guarantee that it stores the necessary data nor guarantees 
that it is structured as required by the analysts. Therefore, some researchers 
have gone a step further, and include a requirements analysis stage in their 
DW design process H3 HB|, providing an alignment between the DW and 
the analysts’ needs, thus reducing the failure rate of DW projects. Most of 
these requirement analysis stages are based on intentional analysis, where the 
analysts’ needs are modelled as goals. An example of the intentional analysis 
proposed in ESI, using an i* PC3| profile, is shown in figure 15.21 This model 
represents the user intentions starting from a Business Process. In order to 
improve this process, there are a series of goals to fulfill, which are eventu- 
ally achieved by means of gathering the necessary information ( Contexts and 
Measures ) , which will be stored in the DW. 

Moreover, in addition to the requirementsnanalysis stage, a hybrid ap- 
proach for designing DWs by using the Model Driven Architecture, was pro- 
posed in 1221 I2U This hybrid approach considers both the user needs, at 
requirements level, as well as the existing data in data sources, thus detect- 
ing any problem or conflict in early stages. Since the proposal was aligned 
with MDA, it allows the DW designers to build the DW in a semi-automatic 
way, dramatically cutting development time and costs, and significantly im- 
proving the DW development process. 

Nevertheless, many challenges are still open for further research in DWs in 
order to better support the BI processes. On the one hand, BI processes are 
constantly evolving. Therefore, it is crucial for the DW to include support 
for new requirements as BI processes change. Some work has been recently 
done to address this issue m- On the other hand, there is still ongoing work 
on how to include the unstructured information present in the Web into the 
DW |23 EH 21 ESI ■ Information present on the Web can be highly relevant 
for the business. However, this information is not always accurate and its 
correctness is not guaranteed. Therefore, integrating it directly into the DW 
with the rest of information stored by the organization may lower the quality 
of the data. 

Furthermore, the usability of BI tools is still an issue. Some works m show 
that analysts and decision makers do not exploit all the capabilities provided 
by the BI platforms. This is mainly due to two reasons: (i) these platforms 
provide a huge set of features, transforming the problem of analyzing huge 
amounts of data into a problem of finding the right approach (and the right 
tool) in order to perform the analysis, and (ii) most advanced features pro- 
vided by BI tools actually require technical knowledge of both the tool and 
the underlying data structures, e.g. building interactive dashboards, which 
does not allow analysts to use BI platforms to their full extent. In order to 
solve this problem, the current direction is providing more user-centric tools, 
which provide a personalization of the platform while helping the users in 
their analysis tasks, allowing them to query the data through QA systems 
and more intuitive interfaces. 
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After having briefly presented the history about BI & DWs, we will proceed 
to describe and analyze the upcoming evolution of BI 2.0. 

5.3 Basic Concepts Related to BI 2.0 

In this section we will introduce the basic concepts related to BI 2.0. Each 
concept included in this section is either a technology or an aspect related to 
the new vision of BI. 

• Real-time: Real-time usually refers to the fact that the maximum time 
required for a task to be completed is known and defined. However, in 
BI, real-time often refers instead to the concept of up-to-date data |5J, 
and data streaming. Therefore, when talking about real-time data in BI, 
we must deal with data flows whose structure is unknown until they are 
being interpreted. The lack of completeness of the information at a certain 
point forces to manage exceptional situations which would otherwise not 
occur. For example, we may have only partial information about a certain 
transaction, such as the amount, date and client related to a transaction, 
but not the products. 

• SaaS (Software as a Service) 0|: SaaS refers to software being developed 
for its use through Service Oriented Architectures (SOA). The Service 
Oriented Architecture Protocol (SOAP) allows to invoke pieces of software 
through the HTTP protocol, by passing the parameters specified by the 
service interface. In this way, SOAP provides a way to invoke software 
services while abstracting from the implementation technology. Recently, 
this vision is also being applied to BI solutions. This allows us to obtain 
results from algorithms, reports and other interesting information from 
a remote server and combine this services into new value-added ones. 
The most interesting feature about SaaS is that it allows us to deploy 
solutions based on cloud computing, which are highly scalable, as opposed 
to traditional solutions. Additionally, this approach allows companies to 
“rent” their software and BI services through internet. 

• Cloud computing |2T.: Cloud computing was originated as the integra- 
tion of several, heterogeneus elements into a “cloud” or network. A mid- 
dleware layer provides a homogeneous interface for the user or software 
accessing it, while hiding the details of the underlying technology. The 
cloud also supports the addition of new elements, allowing us to increase 
the network capabilities to meet the demand as necessary depending on 
the work load. 

• Collective intelligence [23 : Collective intelligence was first used to refer 
to emerging behaviours in colonies. Often, these colonies would be formed 
by insects, and presented more complex behaviours than those of the 
individuals forming them. An example collective intelligence can be seen 
especially in social networks, where decentralized groups of people with 
no leader are able to take decisions and promote initiatives which would 
not be possible by a single individual. 
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• CrowdsourcingPll: Crowdsourcing refers to delegating a certain task 
to the crowd. The effectiveness of this approach comes from the fact that, 
typically, each individual must only perform little effort in order for the 
group to achieve its goal. Moreover, the collective intelligence present in 
the crowd can obtain solutions which could be very difficult to discover 
by an individual, even if he was dedicated exclusively to that task JJjj . 

• Social networks: Social networks consist of a collection of data provided 
by its own participants, as well as the relationships between them. These 
networks allow their participants to interact and contribute with further 
information, thus enriching the existing data. The collaboration between 
participants provides faster and better results than what a single individ- 
ual can achieve, thus, researchers are developing proposals which apply 
this concept to BI. P Hit. 

• Linked data|2nj: Linked data refers to the idea of relating each piece of 
information to the rest of information which affects or is affected by that 
piece. Ideally, linked data means knowing and being able to exploit the 
existing relationships between every piece of information recorded, which, 
in turn, means that the relationships are semantically tagged and can be 
used by a computer to reason. This aspect is specially relevant in order 
to automatically obtain knowledge from existing information in the web 
and the information stored in DWs m 

• Opinion mining |i£2l : Opinion mining refers to the process of describing 
the general feelings or opinions of a group of people towards a certain 
element. In this way, opinion mining implies being able to understand a 
given set of opinions and obtaining a conclusion from them. This infor- 
mation is typically found in the web as unstructured data. Nevertheless, 
this information can be highly relevant for an organization, enabling to 
identify which products are perceived better by the customers and why. 

• Process oriented BI[23j: Process oriented BI is a point of view that fo- 
cuses on the processes and their logic, relating the stored data to business 
process performance, instead of focusing on simply presenting aggregated 
data in different formats. This point of view allows us to identify and re- 
structure processes which are not contributing towards the business goals. 

After having introduced the basic concepts, we will proceed to describe 
how the web, and the new technologies, have influenced the vision of BI and 
the features included in BI tools. 

5.4 Influence from the Web on Business Intelligence 

As society evolves, and the ratio of connectivity to the Internet increases, 
organizations find themselves environment that is constantly and rapidly 
changing. In the recent years, more and more businesses have started to 
provide their services in an online fashion. Therefore, now customers have 
easy access to a wide variety of offers, and become much more critic with the 
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products they buy. With the appearance of social networks, the blogosphere, 
and the web 2.0 Mi customers interchange opinions about the products 
they buy and other similar offers. These opinions influence other customers 
either motivating them to also purchase that product, or stopping them 
from buying it. In this context, businesses need to consider as much informa- 
tion as possible when taking strategic decisions in order to gain the 

edge and retain or even increase their share of the market. Moreover, these 
decisions must be agile, in order to react in time to existing problems and 
threats, or covering possible weaknesses. The introduction of these new needs 
has altered different aspects related to BI tools and processes, such as: user 
interfaces, up-to-date periods of information, number of persons required to 
take a decision, how data is presented and the focus of analysis. 

The first aspect that has experienced an evolution are the user interfaces 
of the BI solutions. Nowadays, the information required to take decisions must 
be available to be checked from anywhere. Additionally, the way of interacting 
with the tools must be intuitive and require as little training as possible. 
Therefore, desktop applications have become obsolete and web interfaces are 
almost mandatory. In fact, most BI solutions |23 EBl EDI EDI already include 
a website as their interface. Recently, these web interfaces have been adapted 
to be visualized in mobile devices, allowing the appearance of mobile BI 
solutions, often referred as Mobile BI. Mobile devices are characterized by 
their reduced size and memory, with interaction mainly based on clicking a 
touch screen. Therefore, BI 2.0 solutions must provide services which generate 
user interfaces adapted to show information in a very limited space, while also 
allowing us to interact and navigate through the data with simple clicks, and 
transferring only the strictly necessary information. 

The second aspect which is changing is the information up-to-date pe- 
riods. Initially, information was typically presented in the form of reports, 
designed to identify trends and allowing to elaborate strategies in order to 
avoid future problems. However, nowadays reports are checked when there is 
already a problem. This is mainly due to two reasons. The first reason is that 
they are not easy to be read for decision-makers, unless they are being used to 
identify an existing problem. The reason is that they typically present a huge 
quantity of non-interactive data, thus they do not allow to perform ad-hoc 
analysis and its difficult to relate them to business goals and strategies. The 
second reason is that, very often, they arrive too late due to the rapid change 
of the environment. In the current environment, monthly and weekly reports 
are already dated when they are generated. For example, a defective product 
or a security problem can harm the corporate image in under a week, or even 
in less than one day. Some examples are the well-known Dell hell history or, 
more recently, the security breaches at Sony. Previously, unattended com- 
plaints, and most problems, would just dissapear and other customers would 
not notice. Now, customers interact with each other and post their opinions, 
forcing the business to react at the same speed as this information is spread. 
Therefore, businesses now need to gather and analyze their incoming data in 
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real-time, and due to its size, in most cases this analysis is required to be 
done automatically. 

The third aspect that is evolving is how decisions are taken, inspired by 
the recently seen social networks. Traditionally, decision-makers would take 
decisions according to reports provided by the BI system. These decisions 
would be taken in an isolated manner or in small groups of executives and 
analysts. Some authors |3 HH propose that, a better way to take decisions, 
is to use the collective intelligence of the enterprise or directly delegate them 
to the crowd ( crowdsourcing ). In this way, when taking a decision, an agile 
forum of discussion could be created, where people with knowledge over the 
different aspects involved in the decision could argue which is the best course 
of action. Alternatively, another way to provide additional human insight, is 
to directly use the contributions from employees as an additional source of 
information, instead of involving them into the discussion. 

The fourth aspect suffering changes is the interactivity between the user 
and the BI system. This aspect, related to the previous one, revolves around 
the ability to interchange information between BI users, as well as contribut- 
ing with information enriched or created by the own user. This aspect has 
been emphasized by the Web 2.0 and the new capabilities provided by this in- 
teraction, specially on blogs and web applications. In this way, data provided 
by the BI system should no longer be shown as read-only reports, graphs 
and figures, but rather as a read and write element in which users can in- 
clude annotations, link with other elements, and easily send the information 
to other BI users. This last feature is very relevant, since a recent study from 
Penteo m showed that, nowadays, Excel spreadsheets are the third most 
used BI tool. The main reason behind this trend is that executives find easy 
to interact with it, and because is easy to send a spreadsheet to a partner 
which he can open right away. 

The fifth aspect is how data is presented. Currently, most BI systems 
are focused on just showing data in different ways like bar graphs, spread- 
sheets, aggregations, etc. However, these presentations just focus on how to 
show the information provided by the data, and do not take into account the 
cognitive model of the user, or the pursued strategies in order to improve 
the business performance. Therefore, it is harder for BI users to actually find 
relevant patterns or develop new strategies using this visualizations. In order 
to solve this issue, different alternatives have been proposed. Some of these 
alternatives are traditionally introduced in BI tools, such as linking the data 
with balanced scorecards, while other proposals also point to link goal mod- 
els, or even with business process models, with the stored data, thus easily 
identifying which parts of the business strategy require improvement. 

Finally, the last aspect, is related to the focus of analysis. Traditionally, 
there has been an extensive use of descriptive data mining techniques m 
These techniques are used for example, to segment groups of clients and iden- 
tifying different client profiles, analyze past trends or problems, and describe 
the evolution of a certain business activity, like sales. Now, this focus is shifting 
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towards predictive analysis, centered on supporting decisions like “will there 
be enough products to meet the demand? should I produce more?” . However, 
since the environment is changing at a very fast rate, these questions now refer 
to the inmediate future, introducing a strong time restriction on the answer. 

After having presented the different aspects of BI which are experiencing 
an evolution, we will proceed to present the technical challenges which must 
be overcomed in order to support the previously described features. 



5.5 Technical Challenges of Business Intelligence 2.0 

The new aspects to be introduced in BI 2.0 cannot be accomplished unless 

the necessary technology is developed. Therefore, in order to reach the new 

envisioned BI, a series of technical challenges must be overcomed. 

• Real-time Data warehouses pj|: In order to achieve a real-time flow 
of data and be able to process it, the way of capturing this data must 
be modified. Traditionally, data is captured into transactional databases 
when a transaction related to a business process ocurrs. Then, at some 
point determined by the refreshing cycle of the DW, the new data is 
cleaned, processed, and loaded through ETL processes ( bulk feed). After- 
wards, data can be used for analysis and supporting decisions. However, 
this process is time consuming and has a negative impact on the perfor- 
mance of both transactional systems (from where data is obtained) and 
decisional systems (since ETL processes perform operations on tables in 
the DW). Therefore, this approach does not meet the real-time require- 
ments of BI 2.0, rendering unable to use the decisional support systems 
while they are being loaded with new data. In order to overcome this 
technical challenge, two main modifications must be performed on ETL 
processes. On the one hand, data must be captured as it is being recorded 
in transactional systems ( trickle feed). This can be achieved through trig- 
gers in transactional databases, logs, and other existing techniques. On 
the other hand, the whole ETL process should be modelled and auto- 
mated, supporting the incoming data as a workflow (see figure IT Ml) and 
considering potential exceptions. Therefore, data being loaded might not 
be complete, and could have information about certain dimensions of the 
DW missing, or even more, it could present an unexpected pattern, mak- 
ing it necessary to correctly define exception management policies. Once 
the delay introduced by ETL processes has been minimized, query delay 
can be minimized using parallelization, hardware acceleration and main- 
memory database approaches, in order to obtain a seamless real-time BI 
process. Finally, the last aspect to consider in real-time DWs is the analy- 
sis process. Since data is constantly flowing into the DW, the information 
stored is being modified constantly. Therefore, users are not able to per- 
form multiple analysis over the same set of data, since when they perform 
a second analysis, the data has already changed, producing inconsistent 
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results. In order to overcome this limitation, the current approach is to 
save snapshots of the underlying data, allowing the decision makers to 
analyze the snapshots multiple times. 




Fig. 5.3. Real time ETL processes with a continuous parallel workflow 



• Scalability: Scalability issues can affect many different parts of the 
decision-making process. From the transactional databases, from where the 
information is being extracted, to the DW, used as data source for the BI 
solution, or even the network providing the results to the different users. 
Since in the envisaged BI 2.0 the number of BI users is expected to increase, 
scalability issues have an increasing importance. There are three main ap- 
proaches for solving these issues. The first one is by providing more power- 
ful single elements of hardware, while the second is simply providing more 
hardware elements. The third one is a combination of the previous two, 
while trying to achieve the most cost-effective solution. In the first group, 
we can see more and more powerful DW servers being developed by differ- 
ent companies. Some examples are the Exadata from Oracle m and the 
DW server architecture from Netezza EH- These DW servers include very 
powerful characteristics with a high number of cores and huge RAM size, in 
order to increase the speed of the queries formulated against the database. 
In the second group, we have the recently seen cloud computing services, 
which were introduced in Section 5.3. Examples of cloud computing ser- 
vices are the Amazon services (s3, ec2), Azure Cloud from Microsoft, and 
the recent iCloud from Apple. The first solution has the advantage that the 
information is stored and hosted by the organization, which potentially can 
provide more privacy and security. On the other hand, cloud services are 
more flexible to meet the fluctuations in demand throughout the day with- 
out incurring in unnecessary costs, and hide some of the technical problems 
from the users, as described in m- 

• Semi-structured and unstructured data: Nowadays clients inter- 
change opinions through social networks and comments posted in forums, 
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blogs, etc. while competitors post special offers on their website. This 
means that the business no longer has all the relevant and necessary in- 
formation stored m- Therefore, all this information must be gathered 
from the outside. This information is highly relevant for the business. 
Knowing the opinions and ratings given by clients about certain prod- 
ucts E3, or the marketing strategies used by competitors, can provide an 
edge and allow to quickly react by offering special counter promotions. 
However, this information is either presented in a semi-structured format 
(as XML files) or, in most cases, in unstructured natural language or as 
content inside a webpage. Therefore, and due to the size of information 
which must be analyzed, it must be automatically parsed by using special 
algorithms and natural language processing (NLP) tools, in order to iden- 
tify the relevant information and be able to link it with the information 
stored by the organization. However, due to complexity and ambiguity 
of the natural language, NLP tools currently are not accurate enough to 
guarantee that the information obtained is correct. Therefore it is recom- 
mended to not directly integrate this information with the one gathered 
from operational systems and business processes, as it would lower the 
quality of the data and potentially harm decisions taken. This area is still 
open for further research. 

• Predictive data mining: As the focus of analysis is shifted towards the 
inmediate future, the importance of predictive analysis increases. Since 
the amount of information to be analyzed in order to take a decision is 
too big for a human expert, this analysis must be done automatically. 
Therefore, it is necessary to develop algorithms which analyze the data 
and describe the current trend and the expected evolution, in order to 
support the new information needs. Furthermore, it is also important 
to lower the technical knowledge required to apply these algorithms, as 
decision makers have been making little use of data mining techniques in 
the past B9- However, although there are already a number of predictive 
data mining techniques m , the most important restriction introduced 
in BI 2.0 is the time constraint. If the result can not be obtained in 
the required space of time, then the result is rendered useless. Using the 
example from the previous section, if our algorithm predicts that we will 
need additional supply to meet the demand for the rest of the day but the 
result is obtained the next morning, then nothing can be done to avoid 
the situation. In this way, predictive techniques used must obtain results 
in delimited periods of time, even if some precision must be lost. This 
area is also open for further research. 

• Analyzing business processes: Recently, business processes have gained 
increased attention from the community, and more and more research 
is being conducted in this area m , specially focused on business pro- 
cess models. Business process models describe the flow of data through a 
specific business activity, detailing how the state of data objects evolves 
throughout the process. Research in this area has dealt with identifying 
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deadlocks inside processes, providing levels of abstraction to make them 
understandable, or even obtaining business process models from natu- 
ral language specifications [25 j- However, business process models lack 
information about the structure of the underlying data, as well as the re- 
lationship between the business process and the business strategy. There- 
fore, it is necessary to integrate this models with the stored data and 
the business goals, in order to be able to analyze how anomalies, errors 
and improvements in business processes affect the business performance. 
These aspects are related to Business Process Intelligence, and is an area 
open for further research. 

• Linking data ..f(j : The most important feature about a specific piece of 
data in order to analyze it is the different relationships it has with other 
data. Some of these relationships are already explicitly included in data 
stored in the enterprise DW (i.e. facts, dimensions and hierarchies). Other 
relationships are implicit, and can be discovered through the use of data 
mining techniques, obtaining for example, a set of rules implicitly present 
in the data. However, some relationships relate two independent sets of 
data, and might only be identified by a human analyst while inspecting 
information from different sources (e.g. an analyst comparing the sales 
of a product and visualizing similar products from competitors). This is 
specially interesting when trying to analyze data obtained from the Web 
together with the existing information inside the organization. In order 
to be able to automatically reason and infer new knowledge from these 
relationships, they must be modelled and semantically tagged, allowing to 
differentiate the kind of relationship established. In order to model these 
relationships, an approach to link the data must be developed. Some ex- 
amples of proposals for linking data include traceability approaches m 
or using ontologies and semantic tagging m- While generic approaches 
can be applied to link the data, the semantics of the relationships are typ- 
ically domain dependent, since an exhaustive definition of every possible 
relationship between two pieces of data would be non- viable. Since this 
area of research is growing, and recently new sets of relationships have 
appeared (like those in social networks), it is open for further research. 

After having presented the different technical challenges of BI 2.0, we will 
present some tools from different vendors and will analyze which features are 
already integrated and which ones still require further effort to be included. 

5.6 General Overview of Tools Stepping towards BI 2.0 

As BI has gained more and more attention, different vendors have developed 
tools for querying the underlying information stored at enterprise DWs. Some 
of these vendors are big companies like Oracle, Microsoft, SAP, IBM, Mi- 
crostrategy or SAS, while others are open source groups like the one behind 
Pentaho BI. In this section, we take a brief look at different tools provided 
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by some of these vendors and how they are integrating the characteristics we 
have described in Section 5.3. 

The first tool we analyze is the BI tool from MicroStrategy |2J. Micros- 
trategy includes a web interfaces for their BI tool, which is most important 
in Mobile BI. The Mobile BI application is designed for iPad, Blackberry 
and iPhone devices, and provides a way to analyze the stored data from 
anywhere. On the other hand, Microstrategy also provides scorecards and 
dashboards to connect the underlying data and Key Performance Indicators 
with the business strategy and strategic goals, reducing the cognitive effort 
of the users. Finally, Microstrategy is providing some support for predictive 
analysis through mathematical functions, in order to meet the new analysis 
requirements. On the negative side, the BI tool lacks predictive data mining 
support with algorithms, some interactivity and collaboration between BI 
users, and integration of business processes. 

The second tool we analyze is Pentaho m ■ Pentaho is an open source BI 
tool, which can be integrated with different platforms and includes a special 
enterprise version. Pentaho provides a web interface, as MicroStrategy, as 
well as a dashboard for linking data, including support for scorecards but 
requiring some effort to integrate them. Pentaho also includes support for 
predictive analysis, including some data mining algorithms. In the collabo- 
rative aspect, some work has been done to integrate LifeRay and Pentaho 
in order to provide a social feeling. On the negative side, Pentaho also lacks 
support for integrating business processes as well as interactivity. 

The third tool we analyze is Cognos from IBMffJ' ■ Cognos, as the previous 
two tools, includes a web interface, dashboards and scorecards for linking 
data. Moreover, Cognos also includes support for mobile devices, as well 
as collaborative BI support, allowing to include annotations, opinions, and 
enriching and sharing data, in order to achieve a collaborative decision process 
and using the collective intelligence. On the negative side, the predictive 
capabilities of Cognos are limited and require to use additional software for 
fully dealing with predictive analysis. 

The fourth tool we analyze is SAS BI m- SAS provides a web inter- 
face with highly customizable dashboards, which can be linked with other 
elements and dashboards. Along with interactive widgets and visualization, 
SAS BI includes special visualization tools which help to understand the 
underlying data. SAS also provides support for mobile devices, allowing to 
analyze the performance indicators and other information from anywhere. On 
the negative side, there is no information about the predictive data mining 
support, or the inclusion of collaborative BI. 

The fifth tool we analyze is Microsoft BI j3B|. Microsoft BI is composed 
of a series of different technologies, combining Excel, PowerPivot, Sharepoint 
server, and SQL Server, in order to provide a BI system for the business. 
As other vendors, Microsoft includes scorecards and dasboard mashups in a 
web interface through Sharepoint, in order to link data and provide insightful 
information about the business performance. Moreover, Sharepoint allows to 
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add tags to our own profile and check the profile of other people, enabling 
some degree of collaboration. The analysis of detailed data is done through 
Excel and PowerPivot, which allows to also include data from webpages. 
On the negative side, since this BI system is composed of several technolo- 
gies, some expertise is required to use them together without a loss of time 
derivated from integrating the results. On the other hand, the predictive data 
mining techniques provided are limited to Excel functions (aside from using 
algorithms directly on the database which stores the information), and the 
collaborative part lacks some features like enriching the data so other users 
can see these anotations, as well as an easy way to interact while analyzing 
the data. 

The last tool we analyze is SAP BI Em- SAP BI is composed, as in the 
case of Microsoft, of many different tools. SAP includes complete support 
for analyzing the business strategy with a combination of desktop and web 
applications. Among the features provided, SAP includes dashboards and 
scorecards, a web interface for ad-hoc analysis, support for mobile devices, 
and a complete workbench for data mining tasks. Furthermore, SAP also in- 
cludes integration capabilities with Excel and enterprise applications, which 
empowers the analysis capabilities. On the negative side, although the anal- 
ysis capabilities are very powerful, the BI system currently lacks support 
for active collaboration and enriching information (although this aspect is 
being improved 5J), which limits the applicability of collaborative BI and 
crowdsourcing. 



Table 5.1. Summary of features provided by BI tools from different vendors 



Tool 


MicroSt 


Pentaho 


Cognos 


SAS 


Microsoft 


SAP 


Web interface 


Y 


Y 


Y 


Y 


Mixed 


Mixed 


Scorecards 


Y 


Requires effort 


Y 


Y 


Y 


Y 


Dashboards 


Y 


Y 


Y 


Y 


Y 


Y 


Interactivity 


N 


N 


Y 


Y 


Limited 


- 


Collaboration 
between users 


N 


LifeRay 


Y 


“ 


Some 


Under 

develop. 


Enriching 

information 


N 


N 


Y 


N 


Some 


Under 

develop. 


Predictive 

Analysis 


Math 

functions 


Weka 


Additional 

software 


“ 


Excel 


Complete 

support 


Business 

Processes 


N 


N 


N 


N 


N 


Y 


Integration 


Single 

platform 


Independent 

modules 


Single 

platform 


Single 

platform 


Multiple 

software 


Multiple 

software 


Additional 

features 


Dedicated 
Mobile BI 


Open 
source BI 


_ 


Visualization 

tools 


Supports 
web data 


Enterprise 

integration 
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Most of the tools support the use of cloud computing through the SaaS 
approach, in order to deal with the scalability issues. In this way, BI services 
are provided as a service from the cloud, where the BI server is deployed. 
However, some tools also include specific support for cloud computing, like 
Cognos (guide to deploy the server in the cloud), Microsoft BI (Azure cloud), 
and Pentalro (Hadoop, for data integration). On the other hand, in most 
tools, data is linked by means of dashboards and additional code, rather 
than interaction. While this enables a way to present the data in a linked 
way, it requires certain effort to achieve and is not really intuitive. Moreover, 
the presentation of the data can be connected to business strategic goals 
in some tools but it is not used as the main way of visualizing the existing 
data, which is still data-driven. Finally, most tools lack interactivity to enrich 
the data and contribute to a collaborative BI, by providing means to easily 
interchange information and opinions with other BI users. 

5.7 Conclusions 

In this paper, we have presented an overview of the different aspects related 
to BI 2.0. We have described each aspect and its motivation, as well as the 
technical challenges which have to be overcomecl. The main consensus about 
BI 2.0 is that it should be processed in real-time, in a more intuitive and 
collaborative manner, as opposed to the previous generation. Some of the 
technical challenges already have the necessary technology to be overcomed, 
but still require effort to achieve its integration into BI tools. Most of these 
tools already include a few envisioned characteristics at the moment. How- 
ever, we have not yet reached the envisioned BI 2.0, and some time will be 
required for most tools to completely adapt and include the required features. 

On the other hand, further research is necessary to identify the most effec- 
tive way of presenting data, as well as developing easy to use, more accurate, 
and faster predictive algorithms, while also identifying which are the best 
practices for interacting and contributing with information inside the busi- 
ness environment. 
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Summary. The incredible rising of on-line social networks gives a new and very 
strong interest to the set of techniques developed since several decades to mining 
graphs and social networks. In particular, community detection methods can bring 
very valuable informations about the structure of an existing social network in 
the Business Intelligence framework. In this chapter we give a large view, firstly 
of what could be a community in a social network, and then we list the most 
popular techniques to detect such communities. Some of these techniques were 
particularly developed in the SNA context, while other are adaptations of classical 
clustering techniques. We have sorted them in following an increasing complexity 
order, because with very big graphs the complexity can be decisive for the choice 
of an algorithm. 

Keywords: Graph Mining, Community Detection, Data Mining. 



6.1 Introduction 

In the actual interconnected world, and the rising of online social networks the 
graph mining and the community detection become completely up-to-date. 
If the interest for social networks analysis exists, even in germ, from the 
beginning of sociology, in the works of Emile Durklreim or Auguste Comte, a 
more systematic study started in the 1930s with the work of Moreno, Davis 
and Mayo (see [I and |2J for a more complete state of the art of the social 
network analysis) . And this systematic study includes the use of graph theory. 
The theory of graphs exists since Euler’s solution of the Knigsberg’s bridges 
problem in 1736, but networks, treated in the pioneer times of Euler (for 
graph theory) and Moreno (for social network analysis), contained only some 
dozens nodes at most, while the rising of computer science and one of its 
fields, the data analysis (see (Sj or 0 for a good introduction this latter), 
allow to analyze graphs of big sizes, and permit to develop one of the task of 
social network analysis: finding groups with high concentrations of relations 
within the group and low concentration between these groups, which is called 
community detection H3EHZ|. 
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This overview of the community detection algorithms is organized as fol- 
lows: in Section 2 we introduce the basics of social network and network 
theory. In Section 3 we give a review of the definition of a cluster or commu- 
nity in social network analysis. Then, in Section 4 we take a look to the set 
of measures which can be used in clustering of social networks. Sections 5, 6, 
7 and 8 give a review of the most important clustering methods. The order 
chosen for this review is directly linked to the complexity : we begin with the 
“cheapest”, and then more scalable (partitional and hierarchical methods), 
then we continue with a more greedy but very efficient technique : the spec- 
tral clustering, and we end with the Galois lattices which are costly methods, 
but with very rich results. Then we close this part with a discussion. 



6.2 Social Networks 

Even if Moreno |BJ was the first to use points and lines to represent social 
configurations, it was Cartwright and Harary jj| which made the link with the 
graph theory, and thus introduced the actual graph representation of social 
network (see [H3| for a review of the evolution of social networks representa- 
tion): individuals are represented using points, called nodes or vertices , and 
social relationships are represented using lines, called edges or links , between 
nodes. In figure 101 we show an example of social networks using this graph 



Zachary’s Karate Club Network 




Fig. 6.1. A classical example of social network : the Zachary’s Karate Club 



representation: The first one (fig. I(>. IK is a the well-known graph EH: the 
Zachary’s Karate Club, it consists in 34 vertices, the members of a karate 
club in the United States, who were observed during a period of three years. 
Edges connect individuals who were observed to interact outside the activities 
of the club. 
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Let use introduce here the basic concepts of graph theory. A graph G is 
defined as a pair of sets: G = (V, E), where V is the set of vertices or nodes, 
and E the set of edges or links which connect vertices. The two vertices 
connected by an edge are called endpoints of this latter. Conversely the edges 
connecting a vertice to the other vertices are called incident edges of the 
node. The number of vertices \ V\ = n is called the order the graph, while the 
number of edges \E\ = m is called the size of the graph. If \E\ = n(n — l)/2, 
i.e. if any pair of vertices are connected, then the graph is called complete . 

A graph can be directed or undirected. In the first case, also called digraph, 
an edge is denoted as pair (v,w), where v is the origin and w the target, 
and in the social networks framework it means: “v is in relation with w” , the 
opposite being true if and only if there exists an edge (w,v). If the graph G 
is undirected, then an edge is denoted as the set of its vertices : {u,ru}. For 
the sake of simplification, when we will give a definition valid in both cases, 
we will use this last notation. 

Most of the times, vertices are labeled, but it can also be the case for edges, 
then G is called a labeled graph. Moreover, if exists a function u> : E — > R. 
that assigns a weight for each edge, then G is a weighted graph. 

Two vertices v and u are called neighbors or adjacent, if they are connected 
by an edge. The set of neighbors of a node v, denoted r(v), is called its 
neighborhood. 

The topology of the graph can be captured in the adjacency matrix A = 



( Qij ): where 



a i,j — 



1 if (vi,Vj) G E, 
0 otherwise. 



( 6 . 1 ) 



If G is a weighted graph, then a,;j = u>(vi,Vj) and in this case we prefer 
to use the notation W = (wij) = (u>(vi,Vj)). Of course in an unweighted 
graph w(vi,Vj) G {0,1}. For an unweighted graph, the degree of a vertice 
v, denoted deg{v ), is defined as the number of incident edges, but a more 
general definition is given using the weights matrix W: 



deg(vj) = ^ vj 1:1 . ( 6 . 2 ) 

3 = 1 



A subgraph G' = (V',E’) of a graph G = (V,E) is such V' CV, E' C E 
and {v,u} G E' implies v,u G V' . The graph G is a supergraph of G' . A 
subset C of V can define an induced subgraph G(C) = ( C,E(C )), where 



E(C) = {(v,u) G E\v,u G C } 



(6.3) 



A complete subgraph is called a clique. 

The density of a subgraph C(V(C),E(C)) is the ratio between \E(C)\ and 
the maximum possible number of edges: 



S(G(C)) 



\E(C)\ 

\V{C)\{\V{C)\-l)/2 



(6.4) 
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this definition still being valid for the whole graph G. 

A partition of the vertices set V in two subsets C and V \ C is called a 
cut. The cut size, denoted c(C, V \C) is the number of edges of G joining 
vertices of C with vertices of V \ C: 



Both set C and V \ C define the cut, but usually the cut is identified by the 
smaller one. 

A path between two vertices v and u in a graph G is a sequence of edges 
starting with vq = v an such the last one is Vk = u : 



A path P is also a subgraph of G: P = (V(P), E(P)) with V(P) = 
{^o, - - ■ ,Vki} and E(P) = {{i; 0 , «i}> {«i, v 2 }, ■ ,{vk-i,Vk}}- The length of 

a path is the number of edge in this path, k in expression ( 16 . 611 . A shortest 
path or geodesic between two vertices is a path of minimal length, and the 
distance between two vertices is the length of a geodesic between these two 
vertices. The diameter of a graph is the maximal distance that can be found 
between two nodes. 

If no vertice is repeated, then the path is simple. If there exists a path 
between two vertices v and it, they are connected. The graph is also called a 
connected graph if for any pair of vertices v and it, there is, at least, one path 
connecting v and it. Otherwise, i.e. if there is some vertices which cannot 
be reached from other, then the graph is disconnected. If there is two nodes 
without path between them, then there is, at least two connected subgraphs, 
and a maximal connected subgraph is called a connected component. The edge 
connectivity of a graph G is the minimal number of nodes to be removed so 
that G is disconnected, and is denoted k(G). 

A cycle is a path such the first and the last node are equal, i.e. i>o = Vk- 
A graph without cycle is called a forest or an acyclic graph, and a connected 
forest is a tree. The edge connectivity k(T) of a tree T is equal to one, 
because, if the tree contains n vertices, it had n — 1 edges, and if any of these 
is removed, the tree is divided in two disconnected trees. 

A connected acyclic subgraph G' = (V',E') such V' = V, i.e. all the 
vertices of G are also in G' , is called a spanning tree. Every connected graph 
contains a least a spanning tree. For weighted graphs, a minimum spanning 
tree is the spanning tree such the sum of the weights of the edges is minimal. 
We can define a maximum spanning tree similarly. 

6.3 Community Definitions 

In the clustering framework a community is a cluster of nodes in a graph 1321 , 
but a very important question is what is a cluster? Even in the clustering 
literature there is non complete agreement between all authors (see uni for a 



c(C, V \ C) = | {{u, u} g E\u g C, v g V \ C} | 



( 6 . 5 ) 



{ v : I’ll, {vi,V2}, ■ ■ ■ , {Vk-l,u}. 



( 6 . 6 ) 
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review), but most of the time, the objects in a cluster must be more similar 
than objects out of this cluster: the objects are clustered or grouped based 
on the principle of maximizing the intra-class similarity and minimizing the 
inter-class similarity. Let us remark that, this definition implies the necessity 
to define the notions of similarity measure and/or cluster fitness measure. 

In the graph framework, we can agree that the goal of clustering is to divide 
the data set into clusters, such nodes of a cluster must be more connected 
with nodes of this cluster, than with nodes outside of the cluster (7 and jHj . 
It implies that it must exists at least a path between two nodes of a cluster, 
and this path must be internal to the cluster. 

Then the transposition of the above definition of a cluster into the graph 
context could be the following: the vertices are clustered or grouped based on 
the principle of maximizing the intra-class connections and minimizing the 
inter-class connections. 

There is several manner to quantify these internal and external connections 
of a cluster C. A first one is to divide the degree of a vertice in two parts: 
the internal degree degify , C) and the external degree deg e (v, C): 

de 9i {v,C) = \r{v)nC\ (6.7) 



deg e (v,C) = \r(v)n(V\C)\. ( 6 . 8 ) 

And, as r(v) = {/» n C} U {/» O (V \ C)} 

deg{y ) = degi(v, C) + deg e (v, C). (6-9) 



Of course, if deg e (v,C) = 0, then v G C is surely a good assignation for v, 
and conversely if degi(v, C) = 0, then we must have v ^ C . 

The internal and external degrees, can be seen as the “vertice ’s point of 
view” of indicators of the belonging to a cluster. For a “cluster’s point of 
view”, we must take a look at the notion of graph density. The intra-cluster 
density 5t(C ) and the inter-cluster density 8 e (C) are adapted versions of 
the density of a subgraph defined by expression (Iti.-IH . the first one being 
the quotient of the number of internal edges of C and the maximal possible 
number of internal nodes: 



, (r s _ C}\ 

j IC'KIC'I - l)/2 



( 6 . 10 ) 



and the second one being the result of the number of edges with one vertice 
inside C, and the other outside of C, divided by the maximal possible number 
of edges in this configuration: 



= |{u,u}|u € C,v j C}\ 
eK } IC'KIC'I - IC'D 



( 6 . 11 ) 



Of course for a given cluster C we expect Si(C) and 8 e (C) to be substantially, 
respectively, larger and smaller than the average density 8(G). And, in a 



122 



E. Cuvelier and M.-A. Aufaure 



“partition’s point of view”, the internal density of the partition, given by the 
sum of densities over all the clusters, must be appreciably higher than the 
density of the graph 6(G). 

The comparison of the density of inner ties versus the average density 
of the graph, leads to define a community in comparison to the rest of the 
graph, but a community can be considered independently of the graph as a 
whole. Local definitions of communities 0, focus only on the cohesion of the 
studied subgraph, including possibly its direct neighborhood, but ignoring 
the rest of the graph. In j2j Wasserman identify four criteria to define a 
cohesive subgroup: complete mutuality, reachability, nodal degree, internal 
versus external cohesion. 

The concept of complete mutuality states that, in a very strict sense, in a 
community, all member of a subgroup must be “friends” with all members of 
the subgroup. In graph theory, it corresponds to a clique H3- 

But the definition of a community as a clique is very too strict that is 
why relaxed definitions of the notion of clique leads to the reachability. An n- 
clique (or k-clique ) m is a maximal subgraph such, for any pair of vertices, 
there exists at least a geodesic no larger than n (or k). The classical clique 
is then a 1-clique. But a geodesic path of an ?r-clique could run outside of 
this latter, and then the diameter of the subgraph can exceed n. That is why 
the notion of n-club and n-clan was suggested m- An n-clan is an n- clique 
with diameter not larger than n, while an n-club is a maximal subgraph of 
diameter n. 

The use of the nodal degree to define a community imposes a constraint 
on the number of adjacent vertices. A k-plex DU is a maximal subgraph such 
each vertice is adjacent to all other vertices of the subgraph except for k of 
them. Conversely a k-core is a maximal subgraph such each vertice is adjacent 
to, at least, k other vertices of the subgraph | {TBj . 

Finally, comparing internal versus external cohesion , an LS-set m , or 
strong community [[20! is a subgraph C such for each node v £ C we have 
degi(v,C) > deg e (v,C). 

Another point of view to define a community is to say that the number of 
ties with the outside of the community must be low. That leads to use the 
notion cut size to define a cluster, and to try to minimize it. Rather than 
using directly the cut size, the conductance m *(co of a community C is 
defined to taking into account the order of the cluster and the outside of the 
cluster: 



$(C) 



c(C,V\C) 

min {deg(C),deg(V \ C)} 



( 6 . 12 ) 



where deg(C) and deg(V \ C ) are the total degrees of C and of the rest of 
the graph. The minimum of the conductance is obtained when a C commu- 
nity have a low cut size and when the total degree of the cluster and its 
complement are equal m- 

The cut size is sometime also called the external degree of a community C 
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deg e (C) = | {{w, i>} £ E\u £ S, v £ V \ 5} | 
while the internal degree is defined by 

degi(C) = | {{it, i;} £ E\u, v £ S} \ 



(6.13) 

(6.14) 



and then deg(C) = degi(C) + deg e {C ) . Internal and external degrees are used 
to define the relative density j'231 : 



P(C) 



degj(C ) 
deg(C) ' 



(6.15) 



A community with strong inner ties must have a higher p. 

Another idea to define a community, is given using the possible flow of 
information. In a group, given two nodes, at least one shortest path between 
these nodes, passing through the edges of the group must exist. Conversely 
if an edge is on many shortest path between several nodes, we can suppose 
that it is a connection edge between two communities. Finding the connect- 
ing edges permits to find the connected communities. It is the idea of the 
betweenness, and more precisely the edge (or site) betweenness introduced 
by P3 ■ The figure 16.21 gives an illustration of a node with a maximum edge 
betweenness. 

Finally, the notion of edge connectivity k is also used to define a com- 
munity: an highly connected subgraph (HCS) (23 is a subgraph C £ G such 



m > \ ■ 



(6.16) 



6.4 Measure for Clusters 

Most of the clustering algorithms are based on one or several measures, to 
be optimized and/or to be used to compare different cluster affectations. We 
are going to give here the most popular ones. 

If we can embed the graph into a n-dimensional Euclidean space, then we 
can use the classical distances like Euclidean distance, Manhattan distance 
or cosine similarity, but this embedding into a n-dimensional space can be 
seen as an artificial construction, and then a distance defined in such a space, 
used on a graph, is subject to the same criticism. It can be more suitable to 
work directly with informations included in the adjacency matrix, defining a 
distance based on this latter -b| > PI ; 

di,j = , 1^2 (a»,fe - a i,k ) 2 - (6-17) 

Y k^i,j 

This dissimilarity measures the structural equivalence PH: two vertices are 
structurally equivalent if they have the same neighbors, which is the case 
when dip = 0. 
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Edge Betweenness 




Fig. 6.2. An example of maximum edge betweenness (edge between nodes 2 
and 6) 



Another measure directly defined on the adjacency matrix is the Pearson 
correlation matrix computed on rows or columns of A: 






^2k= l( a i,k hi)( a j,k hj) 



( 6 . 18 ) 



with ^ = J2 k a-i,k/n and = -\/Sfc( a iA - Hk) 2 /n. 

Another popular seed to build (dis) similarity measure is the Jaccard index 
which measures similarity between sets: 



J(A,B) 



\AnB\ 

\AuB\' 



( 6 . 19 ) 



A first use of the Jaccard index in the graph theory context is to measure 
the overlap of the neighborhoods of two nodes v and u : 



cv(v, u ) 



\r(v)nr(u)\ 

\r(v)ur( u )\ 



( 6 . 20 ) 



which is equal to zero when there is no common neighbors, and one when v 
and u are structurally equivalent. 



6.5 Partitional Clustering 

Partitional algorithms try to find a partition of a set of data, with a given 
number of cluster equal to k. The “best” partition is searched using jointly, 
most of the times, a distance or dissimilarity measure and a quality criterion 
of the found partition. 



